Ranking Clustered Data with Pairwise Comparisons
|
|
- Stephanie Madlyn Ellis
- 5 years ago
- Views:
Transcription
1 Ranking Clustered Data with Pairwise Comparisons Alisa Maas 1. INTRODUCTION 1.1 Background Machine learning often relies heavily on being able to rank the relative fitness of instances in a large set of data, but in many settings the only reasonable way to acquire information on the ranking is through making pairwise comparisons between the instances. Oftentimes, directly comparing two data points requires human intervention, which can be expensive in terms of time and other resources. Work by Jamieson and Nowak [3] mitigates this problem by providing an algorithm for minimizing the number of pairwise comparisons necessary to rank a data set, assuming that the set follows a particular structure. In particular, they assume that the set can be embedded in a Euclidean space and that each point s fitness is inversely proportional to its distance from an unknown reference point. The algorithm presented in that paper achieves full ranking of the sample space with Θ(d log n) comparisons in the average case, where d is the dimensionality of the Euclidean space and n is the number of samples. This is a nontrivial improvement over general-purpose comparisonbased sorting algorithms which require Ω(n log n) pairwise comparisons even in the average case. The work also presents a version of the algorithm that works in the noisy case, where each query to the comparison oracle has some probability of returning an incorrect result (and keeps returning the same result on further queries). This returns a full ranking of some subset of the samples that is consistent with all query responses and uses Θ(d log 2 n) queries in the average case. 1.2 Contribution We generalize Jamieson and Newok s algorithm to perform on data sets whose rankings follow a more general form. Specifically, we expect the data to be partitioned into k clusters where each cluster has its own unknown reference point, and the distance from each data point to its corresponding reference point determines its fitness. We demonstrate that this extended algorithm achieves Θ(kd log n + n log k) comparisons in the noiseless case, and additionally that it performs robustly in both the noiseless and noisy cases, on both points drawn uniformly from a hypercube and realworld data on product reviews, respectively. 2. APPROACH Since we expect clustered data, our input consists not just of a set of points, but of a mapping from point to cluster as well. As in Jamieson and Nowak s algorithm, we do not assume the reference point will be provided with the rest of the input. Our algorithm is fully general, and when run with only a single cluster, performs comparably to Jamieson and Nowak s algorithm. In contrast, Jamieson and Nowak s algorithm relies on the assumption that all of the points are ranked by their distance to a single reference point, and it is plain to see that their algorithm does not perform correctly when samples are ranked by different reference points. 2.1 Noiseless Case In the noiseless case, we sort each set of clustered samples independently, using a modified version of Jamieson and Nowak s algorithm (we worked entirely in the primal formulation, while they work in the dual). Jamieson and Nowak use linear programming to impute data about pairwise comparisons that haven t yet been performed, given the additional information that all the samples are sorted based on their relative distance to an unknown reference point. A direct comparison is only ever made when a point s comparison data is ambiguous. Comparison data between x, y is ambiguous if the reference point could be closer to x, or could be closer to y, taking into account all previous constraints on its location due to the incomplete ordering we have thus far attained. This result is very closely tied to the assumption that all the points are ranked by their distance to a single reference point, hence is it clear that their results do not generalize well if the samples are ranked relative to different reference points. Jamieson and Nowak use the number of possible orderings of the data to prove theoretical upper bounds on the number of comparisons they require. Our implementation of Jamieson and Nowak s algorithm can be used in conjunction with any sorting algorithm, but we chose to use Python s timsort implementation. After running Jamieson and Nowak s algorithm, we have k sets of lists, which are all relatively sorted. We merge each set of clustered points pairwise, in the manner of the merge step of mergesort, incurring additional overhead of Θ(n log k). Thus for the noiseless case, we have total overhead of Θ(kd log n+ n log k). Compare this to the minimal number of comparisons that could be made by any sorting algorithm on data with this structure. A lower bound on the minimal number of possible comparisons can be obtained by considering the minimal number of pairwise comparisons on the maximal number of possible orderings. This is obtained when all the clustered points are split evenly across the clusters. Thus no sorting algorithm can do no better than max (2dk log n, k log k), k
2 which is derived by considering sorting each individual cluster with no overhead for merging. As the number of clusters approaches the number of samples, the lower bound on the runtime approaches Ω(n log n). In the case that k = 1, this is exactly the order of comparisons required by Jamieson and Nowak s algorithm, Θ(d log n). 2.2 Noisy Case In the noisy case, we also sort each set of clustered points using Jamieson and Nowak s algorithm. In this case, rather than accepting any pairwise comparison as correct, Jamieson and Nowak assume there may be some noise in the comparisons. Thus, they construct a voting set from previously sorted pairs to determine where each element belongs. This is most intuitive to implement with an insertion sort, though other sorting algorithms could be used. The size of the voting set is a parameter, R, which can be tuned based on the expected error in the data set. If there is a plurality decision among the voting set, we abide by its decision. Otherwise, we throw out the element we are trying to insert, as its data is inconsistent with earlier data obtained. If we throw out a sample, we will never use it in a voting set, and it will not be part of the final ranking. If we cannot construct a voting set of size R for a given sample, we will also throw it out. This means that our outputted result is neither guaranteed to be perfectly sorted, nor is it guaranteed to include all of the original elements, but in practice, we find that the nearly all elements are mostly sorted. Once we have the sorted clusters, we must merge them. Unlike before, however, we cannot trust each individual comparison within the merge. Thus, we construct another voting set. If we are merging elements x, y from clusters A, B, and comparing x and y yields the result x < y, we compare x to a voting set of neighbors of y in B, where y is ranked lower than each element in the voting set. Symmetrically, if we find x > y, we construct a voting set for y from A. Rather than throw out elements whose voting sets do not reach a plurality decision, we randomly select one of x, y to append to the end of the merged list. In practice, we find that merging in this manner does not substantially increase the number of misordered elements beyond the number already present from Jamieson and Nowak s algorithm. We separately tune the size of this voting set, which we call R. As we increase the sizes of R, R, we find as a general trend that the number of comparisons rise, and the inversions drop. 3. EXPERIMENTAL SETUP 3.1 Noiseless Case We present noiseless experimental data obtained by performing experiments similar to [3]. Similarly to Jamieson and Nowak, we use points on the unit hypercube, with d = 10, 20,..., 100. To simulate clustered data, we pick k = 1, 2, 4 reference points uniformly at random. We cluster each point with the closest reference point to it. Note that our algorithm is fully general, and could support randomly clustering the points as well. We graph the required number of queries in order to obtain a completely sorted set. 3.2 Noisy Case Data Gathering For the noisy case, we found that the data set used by Jamieson and Nowak was no longer available. Thus our results may not be properly comparable to theirs in this case. In order to consider the case where k > 1, we needed a data set that was already clustered (possibly by machine), and ranked by humans. We elected to use reviews from Amazon across several books. Specifically, we downloaded all the reviews denoted as positive and critical (as clustered by Amazon) for The Fault in Our Stars, Game of Thrones: A Song of Fire and Ice, and Life of Pi. We considered the ranking of each review to be the proportion of readers who found the review helpful, which is reported by Amazon. We found after some experimentation that the positive clusters were not useful for our purposes, as positive reviews elicited fewer votes, so each cluster was much more noisy. Our motivation for using critical reviews on between different products would be to present a user in the market for a particular type of product with critical reviews from most to least helpful, enabling a more informed decision. We adopted a bag of words approach on each review, removing common stop words from consideration. We also removed from consideration reviews that had fewer than 20 votes. Below this threshold, we believe the data was much more noisy and the helpfulness proportion serves as poor representation of the ranking. Each cluster had approximately 30 reviews contained within it, and all the review sets had similar distributions of helpfulness proportions. Now we have a Euclidean representation of our data, and a hypothesis that its bag of words representation relates to its ranking, but our dimensionality far exceeds our number of samples. Our bag of words approach resulted in points of dimension roughly R 4000, while we wanted to consider approximately 30 samples per cluster. Recall that the complexity of our algorithm is Θ(kd log n + n log k). Clearly, with d >> n, our algorithm is infeasible. We used multidimensional scaling [1] to reduce the dimensionality to 10 and 15, and compare the results in each case. Generally, we found that for d = 10, we used fewer comparisons, which is unsurprising given the complexity of our algorithm, but also had more out of order points. Thus the dimensionality for the bag of words approach appears to be a trade off between number of comparisons and accuracy of ranking Sources of Noise There are several sources of noise in our data set. Firstly, it is possible that bag of words implementation we used does not closely correspond to actual helpfulness of the review in the manner that we expect. There may not be a single theoretically perfect review for a product, and even if there is, the utility of each review may not be perfectly represented by its distance from that review. Humans have different tastes, and based on which people voted on reviews, we may find that the ranking gathered by Amazon does not generalize well. However, we consider this to be unlikely, due to the high number of votes we required on a review. It is also possible that the review ranks are systematically voted up or down based not on the merit of the content, but because of the voter s preferences about the novel. We tried to select for book reviews without this problem. Finally, it is also possible that Amazon s method of clustering reviews by positive and critical may be faulty. In that event, some of the clusters could contain incorrect samples. In practice, our results indicate the definite presence of noise within our
3 data set, though it is difficult to tell the precise source of the noise. However, our results are robust in the face of this noise, and performance is within an acceptable margin of error given the noisiness of the data Evaluation In order to evaluate the noisy case, we determine the number of inversions produced by a run of our algorithm over a set of data. Inversions are the number of pairs of elements that are improperly ordered. We calculate the maximum number of inversions possible on a partial ranking for the outputted size, and consider the proportion of total inversions found to maximum inversions possible to be a suitable metric for the sortedness of the list. For each pair of sets of critical reviews, we performed 20 sorts on the reviews. For each sort, we randomly shuffled the reviews prior to sorting. We aggregate data on the number of comparisons in each sort, the proportion of inversions to maximum possible inversions, and the length of each partial order. In general, we found a dichotomy between inversion proportion and number of pairwise comparisons - the more comparisons we made, the fewer inversions were produced. We also compared triples of sets of critical reviews as well as pairs; these results are discussed in more detail in Results. As expected, our number of inversions and number of comparisons are higher when merging three sets of items as compared to two. 4. RESULTS 4.1 Noiseless Case In the noiseless case, we see in Figure 1 that the number of pairwise comparisons (queries) follows the trend that we would expect. As we have more clusters, the overhead from the merge becomes more significant, and approaches the cost of mergesort. This problem is exacerbated with higher dimensionality. This motivates our multidimensional scaling in the noisy case, as the original dimensionality of the data would be prohibitively high compared to the number of reviews we obtained. 4.2 Noisy Case In the noisy case, we see some robustness data in Figure 2. In this case, there is not a clear baseline for the number of comparisons we may be trying to beat, as the partial order requires assuming that some comparison data may be erroneous. However, we compare the number of queries actually made to the number of all possible pairwise comparisons, and find that we obtain consistently under half of all possible pairwise comparisons. With 10 dimensional data, we find that the number of queries is significantly lower overall. This is due to properties of Jamieson and Nowak s algorithm. As the number of dimensions increases, the number of constraints that must be built into the linear program before any data can be imputed increases, requiring more and more comparisons before any benefit is obtained. The proportion of inversions observed appears consistently under 0.4, which is significantly better than random sorting. There appears to be very small differences between d = 10 and d = 15, though generally d = 15 has a slightly lower proportion of inversions. The proportion of sorted elements is visibly larger for d = 15, but both d = 10 and d = 15 obtain data that is at least 80% sorted, much better than randomly permuting the data. We also attempted to fine tune good values for R and R on our data set. Using The Fault in Our Stars and Game of Thrones, we found R, R had optimal values around R = 7, R = 5. This appears to provide a good balance of a low proportion of inversions (and thus a highly sorted output) and a small number of comparisons. For these values of R, R, we had an average inversion proportion of 0.33 and averaged comparisons for d = 15 among 20 trials, and average inversion proportion of 0.38 and averaged comparisons for d = 10 among 20 trials. Using all three sets of reviews obtained an average of % inversions and comparisons over 20 trials, for R = 10, R = 5. This is clearly greater than the number of comparisons used with only two sets of reviews, but there is a greater number of data (about 30% more data) to sort overall, and so the expected number of comparisons with any sort should be higher. 5. CONCLUSION In conclusion, we provide an extension to Jamieson and Nowak s algorithm which sorts data whose fitness corresponds to distance to one of k reference points. On noiseless data, we obtain Θ(kd log n + n log k) comparisons. In the noisy case, we provide ways of detecting and dealing with noise inherent in most data sets which still sorts most of the data set. On real world data, we obtain results that indicate our algorithm is competitive with traditional sorting approaches, and with Jamieson and Nowak s algorithm, and may offer some relief when pairwise comparisons are expensive. 5.1 Future Work We would like to extend this work in several concrete ways. Firstly, there may be more optimal approaches to merging the k clusters of data that can reduce the overhead of the sort significantly. More sophisticated merging approaches, such as the one provided by Hwang and Lin [2], may further reduce the number of comparisons required. Secondly, we would like to experiment with different methods of merging the noisy clustered data and compare to find the optimal approach. Our current implementation never removes any elements beyond the ones removed by Jamieson and Nowak s algorithm. We may see better results if instead of randomly choosing which element to add in the case that the voting subset cannot decide, we instead removed one or both of the troublesome elements. Additionally, we might be able to reduce the number of comparisons if we took the votes cast by the voting subset into account. If the voting subset agrees that x y, then some of that subset must belong before x. In that case, we may be able to move some of the voting subset into its proper place in the merged array without doing further comparisons. Thirdly, it may be possible to compare optimal values of R, R, the proportion of inversions and the number of comparisons to determine the noise in a particular data set. Since R, R are tuned based on the proportion of noise in the data set, in theory it may be possible to work backward to determine the amount of noise in the data set. 6. ACKNOWLEDGMENTS This work was done in collaboration with Kevin Kowalski.
4 7. REFERENCES [1] M.A.A. Cox and T.F. Cox. Multidimensional scaling. In Handbook of data visualization, pages Springer-Verlag, [2] F.K. Hwang and S. Lin. Optimal merging of 2 elements with n elements. Acta Informatica, 1(2): , [3] Kevin G. Jamieson and Robert D. Nowak. Active ranking using pairwise comparisons. CoRR, abs/ , 2011.
5 Figure 1: The number of comparisons requested by dimension per different numbers of cluster sizes in the noiseless case. The dashed line at the top represents the baseline number of queries required to sort the sets using mergesort. Figure 2: Here we see data on the noisy case, indicating the proportion of queries compared to obtaining all pairwise comparisons, the proportion of inversions to maximum possible inversions, and the proportion of sorted elements.
Ranking Clustered Data with Pairwise Comparisons
Ranking Clustered Data with Pairwise Comparisons Kevin Kowalski nargle@cs.wisc.edu 1. INTRODUCTION Background. Machine learning often relies heavily on being able to rank instances in a large set of data
More informationCOSC 311: ALGORITHMS HW1: SORTING
COSC 311: ALGORITHMS HW1: SORTIG Solutions 1) Theoretical predictions. Solution: On randomly ordered data, we expect the following ordering: Heapsort = Mergesort = Quicksort (deterministic or randomized)
More information5. DIVIDE AND CONQUER I
5. DIVIDE AND CONQUER I mergesort counting inversions closest pair of points randomized quicksort median and selection Lecture slides by Kevin Wayne Copyright 2005 Pearson-Addison Wesley Copyright 2013
More informationUnsupervised Learning and Clustering
Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)
More informationActive Clustering and Ranking
Active Clustering and Ranking Rob Nowak, University of Wisconsin-Madison IMA Workshop on "High-Dimensional Phenomena" (9/26-30, 2011) Gautam Dasarathy Brian Eriksson (Madison/Boston) Kevin Jamieson Aarti
More informationCh5. Divide-and-Conquer
Ch5. Divide-and-Conquer 1 5.1 Mergesort Sorting Sorting. Given n elements, rearrange in ascending order. Applications. Sort a list of names. Organize an MP3 library. obvious applications Display Google
More informationTreaps. 1 Binary Search Trees (BSTs) CSE341T/CSE549T 11/05/2014. Lecture 19
CSE34T/CSE549T /05/04 Lecture 9 Treaps Binary Search Trees (BSTs) Search trees are tree-based data structures that can be used to store and search for items that satisfy a total order. There are many types
More informationEstimating the Quality of Databases
Estimating the Quality of Databases Ami Motro Igor Rakov George Mason University May 1998 1 Outline: 1. Introduction 2. Simple quality estimation 3. Refined quality estimation 4. Computing the quality
More informationUnsupervised Learning and Clustering
Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)
More informationCPSC 536N: Randomized Algorithms Term 2. Lecture 4
CPSC 536N: Randomized Algorithms 2011-12 Term 2 Prof. Nick Harvey Lecture 4 University of British Columbia In today s lecture we will see two applications of negative binomially distributed random variables.
More informationTheorem 2.9: nearest addition algorithm
There are severe limits on our ability to compute near-optimal tours It is NP-complete to decide whether a given undirected =(,)has a Hamiltonian cycle An approximation algorithm for the TSP can be used
More informationCS229 Final Project: Predicting Expected Response Times
CS229 Final Project: Predicting Expected Email Response Times Laura Cruz-Albrecht (lcruzalb), Kevin Khieu (kkhieu) December 15, 2017 1 Introduction Each day, countless emails are sent out, yet the time
More informationCPSC 536N: Randomized Algorithms Term 2. Lecture 5
CPSC 536N: Randomized Algorithms 2011-12 Term 2 Prof. Nick Harvey Lecture 5 University of British Columbia In this lecture we continue to discuss applications of randomized algorithms in computer networking.
More informationHMMT February 2018 February 10, 2018
HMMT February 2018 February 10, 2018 Combinatorics 1. Consider a 2 3 grid where each entry is one of 0, 1, and 2. For how many such grids is the sum of the numbers in every row and in every column a multiple
More informationSeminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm
Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of
More informationPredictive Indexing for Fast Search
Predictive Indexing for Fast Search Sharad Goel, John Langford and Alex Strehl Yahoo! Research, New York Modern Massive Data Sets (MMDS) June 25, 2008 Goel, Langford & Strehl (Yahoo! Research) Predictive
More informationChapter 5. Divide and Conquer. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.
Chapter 5 Divide and Conquer Slides by Kevin Wayne. Copyright 25 Pearson-Addison Wesley. All rights reserved. Divide-and-Conquer Divide-and-conquer. Break up problem into several parts. Solve each part
More informationClassification: Feature Vectors
Classification: Feature Vectors Hello, Do you want free printr cartriges? Why pay more when you can get them ABSOLUTELY FREE! Just # free YOUR_NAME MISSPELLED FROM_FRIEND... : : : : 2 0 2 0 PIXEL 7,12
More informationAllstate Insurance Claims Severity: A Machine Learning Approach
Allstate Insurance Claims Severity: A Machine Learning Approach Rajeeva Gaur SUNet ID: rajeevag Jeff Pickelman SUNet ID: pattern Hongyi Wang SUNet ID: hongyiw I. INTRODUCTION The insurance industry has
More informationCSE 573: Artificial Intelligence Autumn 2010
CSE 573: Artificial Intelligence Autumn 2010 Lecture 16: Machine Learning Topics 12/7/2010 Luke Zettlemoyer Most slides over the course adapted from Dan Klein. 1 Announcements Syllabus revised Machine
More informationDivide and Conquer 1
Divide and Conquer A Useful Recurrence Relation Def. T(n) = number of comparisons to mergesort an input of size n. Mergesort recurrence. T(n) if n T n/2 T n/2 solve left half solve right half merging n
More informationAlgorithms: Lecture 7. Chalmers University of Technology
Algorithms: Lecture 7 Chalmers University of Technology Today s Lecture Divide & Conquer Counting Inversions Closest Pair of Points Multiplication of large integers Intro to the forthcoming problems Graphs:
More informationEvaluating Classifiers
Evaluating Classifiers Charles Elkan elkan@cs.ucsd.edu January 18, 2011 In a real-world application of supervised learning, we have a training set of examples with labels, and a test set of examples with
More informationAlgorithms, Games, and Networks February 21, Lecture 12
Algorithms, Games, and Networks February, 03 Lecturer: Ariel Procaccia Lecture Scribe: Sercan Yıldız Overview In this lecture, we introduce the axiomatic approach to social choice theory. In particular,
More informationExact Algorithms Lecture 7: FPT Hardness and the ETH
Exact Algorithms Lecture 7: FPT Hardness and the ETH February 12, 2016 Lecturer: Michael Lampis 1 Reminder: FPT algorithms Definition 1. A parameterized problem is a function from (χ, k) {0, 1} N to {0,
More informationFINAL EXAM SOLUTIONS
COMP/MATH 3804 Design and Analysis of Algorithms I Fall 2015 FINAL EXAM SOLUTIONS Question 1 (12%). Modify Euclid s algorithm as follows. function Newclid(a,b) if a
More information6 Distributed data management I Hashing
6 Distributed data management I Hashing There are two major approaches for the management of data in distributed systems: hashing and caching. The hashing approach tries to minimize the use of communication
More information1 (15 points) LexicoSort
CS161 Homework 2 Due: 22 April 2016, 12 noon Submit on Gradescope Handed out: 15 April 2016 Instructions: Please answer the following questions to the best of your ability. If you are asked to show your
More information2. On classification and related tasks
2. On classification and related tasks In this part of the course we take a concise bird s-eye view of different central tasks and concepts involved in machine learning and classification particularly.
More informationStructured Light II. Thanks to Ronen Gvili, Szymon Rusinkiewicz and Maks Ovsjanikov
Structured Light II Johannes Köhler Johannes.koehler@dfki.de Thanks to Ronen Gvili, Szymon Rusinkiewicz and Maks Ovsjanikov Introduction Previous lecture: Structured Light I Active Scanning Camera/emitter
More informationAdaptive Robotics - Final Report Extending Q-Learning to Infinite Spaces
Adaptive Robotics - Final Report Extending Q-Learning to Infinite Spaces Eric Christiansen Michael Gorbach May 13, 2008 Abstract One of the drawbacks of standard reinforcement learning techniques is that
More information6 Randomized rounding of semidefinite programs
6 Randomized rounding of semidefinite programs We now turn to a new tool which gives substantially improved performance guarantees for some problems We now show how nonlinear programming relaxations can
More informationA New Combinatorial Design of Coded Distributed Computing
A New Combinatorial Design of Coded Distributed Computing Nicholas Woolsey, Rong-Rong Chen, and Mingyue Ji Department of Electrical and Computer Engineering, University of Utah Salt Lake City, UT, USA
More informationTextural Features for Image Database Retrieval
Textural Features for Image Database Retrieval Selim Aksoy and Robert M. Haralick Intelligent Systems Laboratory Department of Electrical Engineering University of Washington Seattle, WA 98195-2500 {aksoy,haralick}@@isl.ee.washington.edu
More informationDivide-and-Conquer. Combine solutions to sub-problems into overall solution. Break up problem of size n into two equal parts of size!n.
Chapter 5 Divide and Conquer Slides by Kevin Wayne. Copyright 25 Pearson-Addon Wesley. All rights reserved. Divide-and-Conquer Divide-and-conquer. Break up problem into several parts. Solve each part recursively.
More informationLecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013
Lecture 24: Image Retrieval: Part II Visual Computing Systems Review: K-D tree Spatial partitioning hierarchy K = dimensionality of space (below: K = 2) 3 2 1 3 3 4 2 Counts of points in leaf nodes Nearest
More informationE±cient Detection Of Compromised Nodes In A Wireless Sensor Network
E±cient Detection Of Compromised Nodes In A Wireless Sensor Network Cheryl V. Hinds University of Idaho cvhinds@vandals.uidaho.edu Keywords: Compromised Nodes, Wireless Sensor Networks Abstract Wireless
More informationLecture 3: Sorting 1
Lecture 3: Sorting 1 Sorting Arranging an unordered collection of elements into monotonically increasing (or decreasing) order. S = a sequence of n elements in arbitrary order After sorting:
More informationImproved Brute Force Search Strategies for Single Trace and Few Traces Template Attacks on the DES Round Keys
Improved Brute Force Search Strategies for Single Trace and Few Traces Template Attacks on the DES Round Keys Mathias Wagner, Stefan Heyse mathias.wagner@nxp.com Abstract. We present an improved search
More informationComparing Implementations of Optimal Binary Search Trees
Introduction Comparing Implementations of Optimal Binary Search Trees Corianna Jacoby and Alex King Tufts University May 2017 In this paper we sought to put together a practical comparison of the optimality
More informationSubspace Clustering with Global Dimension Minimization And Application to Motion Segmentation
Subspace Clustering with Global Dimension Minimization And Application to Motion Segmentation Bryan Poling University of Minnesota Joint work with Gilad Lerman University of Minnesota The Problem of Subspace
More information1 The range query problem
CS268: Geometric Algorithms Handout #12 Design and Analysis Original Handout #12 Stanford University Thursday, 19 May 1994 Original Lecture #12: Thursday, May 19, 1994 Topics: Range Searching with Partition
More informationCharacter Recognition
Character Recognition 5.1 INTRODUCTION Recognition is one of the important steps in image processing. There are different methods such as Histogram method, Hough transformation, Neural computing approaches
More informationData Mining. Lecture 03: Nearest Neighbor Learning
Data Mining Lecture 03: Nearest Neighbor Learning Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Prof. R. Mooney (UT Austin) Prof E. Keogh (UCR), Prof. F. Provost
More informationCSE 202 Divide-and-conquer algorithms. Fan Chung Graham UC San Diego
CSE 22 Divide-and-conquer algorithms Fan Chung Graham UC San Diego A useful fact about trees Any tree on n vertices contains a vertex v whose removal separates the remaining graph into two parts, one of
More informationSemi-supervised learning and active learning
Semi-supervised learning and active learning Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Combining classifiers Ensemble learning: a machine learning paradigm where multiple learners
More informationCSE151 Assignment 2 Markov Decision Processes in the Grid World
CSE5 Assignment Markov Decision Processes in the Grid World Grace Lin A484 gclin@ucsd.edu Tom Maddock A55645 tmaddock@ucsd.edu Abstract Markov decision processes exemplify sequential problems, which are
More informationSensor Tasking and Control
Sensor Tasking and Control Outline Task-Driven Sensing Roles of Sensor Nodes and Utilities Information-Based Sensor Tasking Joint Routing and Information Aggregation Summary Introduction To efficiently
More informationVoronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013
Voronoi Region K-means method for Signal Compression: Vector Quantization Blocks of signals: A sequence of audio. A block of image pixels. Formally: vector example: (0.2, 0.3, 0.5, 0.1) A vector quantizer
More informationAnnouncements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron
CS 188: Artificial Intelligence Spring 2010 Lecture 24: Perceptrons and More! 4/20/2010 Announcements W7 due Thursday [that s your last written for the semester!] Project 5 out Thursday Contest running
More informationAdaptive Supersampling Using Machine Learning Techniques
Adaptive Supersampling Using Machine Learning Techniques Kevin Winner winnerk1@umbc.edu Abstract Previous work in adaptive supersampling methods have utilized algorithmic approaches to analyze properties
More informationCS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp
CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp Chris Guthrie Abstract In this paper I present my investigation of machine learning as
More informationDecision Tree CE-717 : Machine Learning Sharif University of Technology
Decision Tree CE-717 : Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Some slides have been adapted from: Prof. Tom Mitchell Decision tree Approximating functions of usually discrete
More informationTask Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval
Case Study 2: Document Retrieval Task Description: Finding Similar Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 11, 2017 Sham Kakade 2017 1 Document
More informationCS6375: Machine Learning Gautam Kunapuli. Mid-Term Review
Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes
More informationAdvanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret
Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret Greedy Algorithms (continued) The best known application where the greedy algorithm is optimal is surely
More informationTopics in Machine Learning
Topics in Machine Learning Gilad Lerman School of Mathematics University of Minnesota Text/slides stolen from G. James, D. Witten, T. Hastie, R. Tibshirani and A. Ng Machine Learning - Motivation Arthur
More informationIntro to Algorithms. Professor Kevin Gold
Intro to Algorithms Professor Kevin Gold What is an Algorithm? An algorithm is a procedure for producing outputs from inputs. A chocolate chip cookie recipe technically qualifies. An algorithm taught in
More informationLecture 5: Duality Theory
Lecture 5: Duality Theory Rajat Mittal IIT Kanpur The objective of this lecture note will be to learn duality theory of linear programming. We are planning to answer following questions. What are hyperplane
More informationChapter 1 Divide and Conquer Algorithm Theory WS 2014/15 Fabian Kuhn
Chapter 1 Divide and Conquer Algorithm Theory WS 2014/15 Fabian Kuhn Divide And Conquer Principle Important algorithm design method Examples from Informatik 2: Sorting: Mergesort, Quicksort Binary search
More informationSentiment analysis under temporal shift
Sentiment analysis under temporal shift Jan Lukes and Anders Søgaard Dpt. of Computer Science University of Copenhagen Copenhagen, Denmark smx262@alumni.ku.dk Abstract Sentiment analysis models often rely
More informationExtremal Graph Theory: Turán s Theorem
Bridgewater State University Virtual Commons - Bridgewater State University Honors Program Theses and Projects Undergraduate Honors Program 5-9-07 Extremal Graph Theory: Turán s Theorem Vincent Vascimini
More information3D Computer Vision. Structured Light II. Prof. Didier Stricker. Kaiserlautern University.
3D Computer Vision Structured Light II Prof. Didier Stricker Kaiserlautern University http://ags.cs.uni-kl.de/ DFKI Deutsches Forschungszentrum für Künstliche Intelligenz http://av.dfki.de 1 Introduction
More informationApplication of Support Vector Machine Algorithm in Spam Filtering
Application of Support Vector Machine Algorithm in E-Mail Spam Filtering Julia Bluszcz, Daria Fitisova, Alexander Hamann, Alexey Trifonov, Advisor: Patrick Jähnichen Abstract The problem of spam classification
More informationData Mining and Data Warehousing Classification-Lazy Learners
Motivation Data Mining and Data Warehousing Classification-Lazy Learners Lazy Learners are the most intuitive type of learners and are used in many practical scenarios. The reason of their popularity is
More informationCISC 4631 Data Mining
CISC 4631 Data Mining Lecture 03: Nearest Neighbor Learning Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Prof. R. Mooney (UT Austin) Prof E. Keogh (UCR), Prof. F.
More informationIn the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,
1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to
More informationAccelerometer Gesture Recognition
Accelerometer Gesture Recognition Michael Xie xie@cs.stanford.edu David Pan napdivad@stanford.edu December 12, 2014 Abstract Our goal is to make gesture-based input for smartphones and smartwatches accurate
More informationUNLABELED SENSING: RECONSTRUCTION ALGORITHM AND THEORETICAL GUARANTEES
UNLABELED SENSING: RECONSTRUCTION ALGORITHM AND THEORETICAL GUARANTEES Golnoosh Elhami, Adam Scholefield, Benjamín Béjar Haro and Martin Vetterli School of Computer and Communication Sciences École Polytechnique
More informationCSC 447: Parallel Programming for Multi- Core and Cluster Systems
CSC 447: Parallel Programming for Multi- Core and Cluster Systems Parallel Sorting Algorithms Instructor: Haidar M. Harmanani Spring 2016 Topic Overview Issues in Sorting on Parallel Computers Sorting
More informationFrom acute sets to centrally symmetric 2-neighborly polytopes
From acute sets to centrally symmetric -neighborly polytopes Isabella Novik Department of Mathematics University of Washington Seattle, WA 98195-4350, USA novik@math.washington.edu May 1, 018 Abstract
More informationDistributed minimum spanning tree problem
Distributed minimum spanning tree problem Juho-Kustaa Kangas 24th November 2012 Abstract Given a connected weighted undirected graph, the minimum spanning tree problem asks for a spanning subtree with
More informationComp Online Algorithms
Comp 7720 - Online Algorithms Notes 4: Bin Packing Shahin Kamalli University of Manitoba - Fall 208 December, 208 Introduction Bin packing is one of the fundamental problems in theory of computer science.
More information26 The closest pair problem
The closest pair problem 1 26 The closest pair problem Sweep algorithms solve many kinds of proximity problems efficiently. We present a simple sweep that solves the two-dimensional closest pair problem
More informationCHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM
96 CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM Clustering is the process of combining a set of relevant information in the same group. In this process KM algorithm plays
More informationLecture 8 Parallel Algorithms II
Lecture 8 Parallel Algorithms II Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Original slides from Introduction to Parallel
More informationConstruction of Minimum-Weight Spanners Mikkel Sigurd Martin Zachariasen
Construction of Minimum-Weight Spanners Mikkel Sigurd Martin Zachariasen University of Copenhagen Outline Motivation and Background Minimum-Weight Spanner Problem Greedy Spanner Algorithm Exact Algorithm:
More informationParallel and Sequential Data Structures and Algorithms Lecture (Spring 2012) Lecture 16 Treaps; Augmented BSTs
Lecture 16 Treaps; Augmented BSTs Parallel and Sequential Data Structures and Algorithms, 15-210 (Spring 2012) Lectured by Margaret Reid-Miller 8 March 2012 Today: - More on Treaps - Ordered Sets and Tables
More informationHALF&HALF BAGGING AND HARD BOUNDARY POINTS. Leo Breiman Statistics Department University of California Berkeley, CA
1 HALF&HALF BAGGING AND HARD BOUNDARY POINTS Leo Breiman Statistics Department University of California Berkeley, CA 94720 leo@stat.berkeley.edu Technical Report 534 Statistics Department September 1998
More informationMachine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
More informationThe Curse of Dimensionality
The Curse of Dimensionality ACAS 2002 p1/66 Curse of Dimensionality The basic idea of the curse of dimensionality is that high dimensional data is difficult to work with for several reasons: Adding more
More informationUsing Statistics for Computing Joins with MapReduce
Using Statistics for Computing Joins with MapReduce Theresa Csar 1, Reinhard Pichler 1, Emanuel Sallinger 1, and Vadim Savenkov 2 1 Vienna University of Technology {csar, pichler, sallinger}@dbaituwienacat
More informationSpectral Clustering and Community Detection in Labeled Graphs
Spectral Clustering and Community Detection in Labeled Graphs Brandon Fain, Stavros Sintos, Nisarg Raval Machine Learning (CompSci 571D / STA 561D) December 7, 2015 {btfain, nisarg, ssintos} at cs.duke.edu
More informationRandom projection for non-gaussian mixture models
Random projection for non-gaussian mixture models Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract Recently,
More informationSolution for Homework set 3
TTIC 300 and CMSC 37000 Algorithms Winter 07 Solution for Homework set 3 Question (0 points) We are given a directed graph G = (V, E), with two special vertices s and t, and non-negative integral capacities
More informationDesign and Analysis of Algorithms
Design and Analysis of Algorithms CSE 5311 Lecture 8 Sorting in Linear Time Junzhou Huang, Ph.D. Department of Computer Science and Engineering CSE5311 Design and Analysis of Algorithms 1 Sorting So Far
More informationSorting Algorithms. Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
Sorting Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. Topic Overview Issues in Sorting on Parallel
More informationIterative Voting Rules
Noname manuscript No. (will be inserted by the editor) Iterative Voting Rules Meir Kalech 1, Sarit Kraus 2, Gal A. Kaminka 2, Claudia V. Goldman 3 1 Information Systems Engineering, Ben-Gurion University,
More informationOne-Point Geometric Crossover
One-Point Geometric Crossover Alberto Moraglio School of Computing and Center for Reasoning, University of Kent, Canterbury, UK A.Moraglio@kent.ac.uk Abstract. Uniform crossover for binary strings has
More informationClustering. RNA-seq: What is it good for? Finding Similarly Expressed Genes. Data... And Lots of It!
RNA-seq: What is it good for? Clustering High-throughput RNA sequencing experiments (RNA-seq) offer the ability to measure simultaneously the expression level of thousands of genes in a single experiment!
More informationMatching Algorithms. Proof. If a bipartite graph has a perfect matching, then it is easy to see that the right hand side is a necessary condition.
18.433 Combinatorial Optimization Matching Algorithms September 9,14,16 Lecturer: Santosh Vempala Given a graph G = (V, E), a matching M is a set of edges with the property that no two of the edges have
More informationBig Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1
Big Data Methods Chapter 5: Machine learning Big Data Methods, Chapter 5, Slide 1 5.1 Introduction to machine learning What is machine learning? Concerned with the study and development of algorithms that
More informationCreating a Classifier for a Focused Web Crawler
Creating a Classifier for a Focused Web Crawler Nathan Moeller December 16, 2015 1 Abstract With the increasing size of the web, it can be hard to find high quality content with traditional search engines.
More informationNonparametric Methods Recap
Nonparametric Methods Recap Aarti Singh Machine Learning 10-701/15-781 Oct 4, 2010 Nonparametric Methods Kernel Density estimate (also Histogram) Weighted frequency Classification - K-NN Classifier Majority
More informationComputer Experiments. Designs
Computer Experiments Designs Differences between physical and computer Recall experiments 1. The code is deterministic. There is no random error (measurement error). As a result, no replication is needed.
More informationHashing. Hashing Procedures
Hashing Hashing Procedures Let us denote the set of all possible key values (i.e., the universe of keys) used in a dictionary application by U. Suppose an application requires a dictionary in which elements
More informationHEURISTIC OPTIMIZATION USING COMPUTER SIMULATION: A STUDY OF STAFFING LEVELS IN A PHARMACEUTICAL MANUFACTURING LABORATORY
Proceedings of the 1998 Winter Simulation Conference D.J. Medeiros, E.F. Watson, J.S. Carson and M.S. Manivannan, eds. HEURISTIC OPTIMIZATION USING COMPUTER SIMULATION: A STUDY OF STAFFING LEVELS IN A
More informationarxiv: v1 [cs.ma] 8 May 2018
Ordinal Approximation for Social Choice, Matching, and Facility Location Problems given Candidate Positions Elliot Anshelevich and Wennan Zhu arxiv:1805.03103v1 [cs.ma] 8 May 2018 May 9, 2018 Abstract
More informationMotivation. Technical Background
Handling Outliers through Agglomerative Clustering with Full Model Maximum Likelihood Estimation, with Application to Flow Cytometry Mark Gordon, Justin Li, Kevin Matzen, Bryce Wiedenbeck Motivation Clustering
More informationParallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. November Parallel Sorting
Parallel Systems Course: Chapter VIII Sorting Algorithms Kumar Chapter 9 Jan Lemeire ETRO Dept. November 2014 Overview 1. Parallel sort distributed memory 2. Parallel sort shared memory 3. Sorting Networks
More information