Approximate Correlation Clustering using Same-Cluster Queries

Similar documents
Approximate Clustering with Same-Cluster Queries

A Survey of Correlation Clustering

Faster Algorithms for the Constrained k-means Problem

Relaxed Oracles for Semi-Supervised Clustering

On the Approximability of Modularity Clustering

On the Min-Max 2-Cluster Editing Problem

Clustering with Same-Cluster Queries

Bipartite Correlation Clustering: Maximizing Agreements

Correlation Clustering with Partial Information

Correlation Clustering in General Weighted Graphs

New Upper Bounds for MAX-2-SAT and MAX-2-CSP w.r.t. the Average Variable Degree

Correlation Clustering and Biclustering with Locally Bounded Errors

arxiv: v2 [cs.cc] 29 Mar 2010

Correlation Clustering

Randomized rounding of semidefinite programs and primal-dual method for integer linear programming. Reza Moosavi Dr. Saeedeh Parsaeefard Dec.

Correlation Clustering

CS229 Final Project: k-means Algorithm

An expected polynomial time algorithm for coloring 2-colorable 3-graphs

Clustering with Interactive Feedback

Algorithm Design and Analysis

Local Search Approximation Algorithms for the Complement of the Min-k-Cut Problems

Hardness of Subgraph and Supergraph Problems in c-tournaments

9.1 Cook-Levin Theorem

SAT-CNF Is N P-complete

Asymmetric k-center is log * n-hard to Approximate

Complexity Results on Graphs with Few Cliques

Improved Approximation Algorithms for Bipartite Correlation Clustering

Flexible Coloring. Xiaozhou Li a, Atri Rudra b, Ram Swaminathan a. Abstract

Bottleneck Steiner Tree with Bounded Number of Steiner Vertices

Sub-exponential Approximation Schemes: From Dense to Almost-Sparse. Dimitris Fotakis Michael Lampis Vangelis Paschos

Sub-exponential Approximation Schemes: From Dense to Almost-Sparse. Dimitris Fotakis Michael Lampis Vangelis Paschos

An Efficient Approximation for the Generalized Assignment Problem

1 Introduction and Results

Comparing the strength of query types in property testing: The case of testing k-colorability

Approximation-stability and proxy objectives

Statistical Databases: Query Restriction

Network Resilience and the Length-Bounded Multicut Problem: Reaching the Dynamic Billion-Scale with Guarantees

Exact Algorithms Lecture 7: FPT Hardness and the ETH

Introduction to Approximation Algorithms

Prove, where is known to be NP-complete. The following problems are NP-Complete:

A New Reduction from 3-SAT to Graph K- Colorability for Frequency Assignment Problem

Call Control with k Rejections

15-451/651: Design & Analysis of Algorithms November 4, 2015 Lecture #18 last changed: November 22, 2015

An Efficient Clustering for Crime Analysis

Testing Euclidean Spanners

THE INDEPENDENCE NUMBER PROJECT:

arxiv:cs/ v1 [cs.cc] 28 Apr 2003

PLANAR GRAPH BIPARTIZATION IN LINEAR TIME

11. APPROXIMATION ALGORITHMS

NP-Hardness. We start by defining types of problem, and then move on to defining the polynomial-time reductions.

Feedback Arc Set in Bipartite Tournaments is NP-Complete

W[1]-hardness. Dániel Marx. Recent Advances in Parameterized Complexity Tel Aviv, Israel, December 3, 2017

A Feasible Region Contraction Algorithm (Frca) for Solving Linear Programming Problems

Copyright 2000, Kevin Wayne 1

Fixed Parameter Algorithms

Approximation Algorithms

PTAS for Matroid Matching

Algorithmic complexity of two defence budget problems

Optimisation While Streaming

Representation Learning for Clustering: A Statistical Framework

Minimizing the Diameter of a Network using Shortcut Edges

All-Pairs Nearly 2-Approximate Shortest-Paths in O(n 2 polylog n) Time

Broadcast Scheduling: Algorithms and Complexity

11.1 Facility Location

On the Computational Complexity of Nash Equilibria for (0, 1) Bimatrix Games

Approximation Algorithms for Item Pricing

Clustering Stable Instances of Euclidean k-means

Topic: Local Search: Max-Cut, Facility Location Date: 2/13/2007

2-query low-error PCP Composition

Approximation Basics

Approximation Algorithms

Approximation Algorithms and Online Mechanisms for Item Pricing

Faster Algorithms for Computing Distances between One-Dimensional Point Sets

Hierarchical Clustering: Objectives & Algorithms. École normale supérieure & CNRS

Randomized Algorithms: Selection

Clustering. (Part 2)

On the Complexity of the Policy Improvement Algorithm. for Markov Decision Processes

Theorem 2.9: nearest addition algorithm

On Covering a Graph Optimally with Induced Subgraphs

Graph Algorithms for Planning and Partitioning

Introduction to Parameterized Complexity

A SIMPLE APPROXIMATION ALGORITHM FOR NONOVERLAPPING LOCAL ALIGNMENTS (WEIGHTED INDEPENDENT SETS OF AXIS PARALLEL RECTANGLES)

On communication complexity of classification

Optimal Parallel Randomized Renaming

PCP and Hardness of Approximation

15-854: Approximations Algorithms Lecturer: Anupam Gupta Topic: Direct Rounding of LP Relaxations Date: 10/31/2005 Scribe: Varun Gupta

More Data, Less Work: Runtime as a decreasing function of data set size. Nati Srebro. Toyota Technological Institute Chicago

Introduction to Approximation Algorithms

Applications of Geometric Spanner

W[1]-hardness. Dániel Marx 1. Hungarian Academy of Sciences (MTA SZTAKI) Budapest, Hungary

Lecture Notes on Congestion Games

Parameterized Complexity - an Overview

Randomness and Computation March 25, Lecture 5

The hierarchical model for load balancing on two machines

Geometric Unique Set Cover on Unit Disks and Unit Squares

Approximating the Virtual Network Embedding Problem: Theory and Practice

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

In this lecture we discuss the complexity of approximation problems, and show how to prove they are NP-hard.

Machine Learning (BSMC-GA 4439) Wenke Liu

Derrick Stolee. April 12,

Transcription:

Approximate Correlation Clustering using Same-Cluster Queries CSE, IIT Delhi LATIN Talk, April 19, 2018 [Joint work with Nir Ailon (Technion) and Anup Bhattacharya (IITD)]

Clustering Clustering is the task of partitioning a given set of objects into clusters such that similar objects are in the same group (cluster) and dissimilar objects are in different groups.

Correlation Clustering Correlation clustering: Objects are represented as vertices in a complete graph with ± labeled edges. Edges labeled + denote similarity and those labeled denote dissimilarity. The goal is to find a clustering of vertices that maximises agreements (MaxAgree) or minimise disagreements (MinDisAgree).

Correlation Clustering MaxAgree Given a complete graph with ± labeled edges, find a clustering of the vertices such that objective function Φ is maximized, where Φ= sum of + edges within clusters and edges across clusters. MinDisAgree Given a complete graph with ± labeled edges, find a clustering of the vertices such that objective function Ψ is minimised, where Ψ= sum of edges within clusters and + edges across clusters. Figure: Φ = 12 and Ψ = 3.

Correlation Clustering MaxAgree Given a complete graph with ± labeled edges, find a clustering of the vertices such that objective function Φ is maximized, where Φ= sum of + edges within clusters and edges across clusters. NP-hard [BBC04] There is a PTAS for the problem [BBC04] MinDisAgree Given a complete graph with ± labeled edges, find a clustering of the vertices such that objective function Ψ is minimised, where Ψ= sum of edges within clusters and + edges across clusters. APX-hard [CGW05] Constant factor approximation algorithms [BBC04, CGW05]

Correlation Clustering MaxAgree[k] Given a complete graph with ± labeled edges and k, find a clustering of the vertices such that objective function Φ is maximized, where Φ= sum of + edges within clusters and edges across clusters. MinDisAgree[k] Given a complete graph with ± labeled edges and k, find a clustering of the vertices such that objective function Ψ is minimised, where Ψ= sum of edges within clusters and + edges across clusters. Figure: Φ = 12 and Ψ = 3 for k = 2.

Correlation Clustering MaxAgree[k] Given a complete graph with ± labeled edges and k, find a clustering of the vertices such that objective function Φ is maximized, where Φ= sum of + edges within clusters and edges across clusters. NP-hard for k 2 [SST04]. PTAS for any k (since there is a PTAS for MaxAgree). MinDisAgree[k] Given a complete graph with ± labeled edges and k, find a clustering of the vertices such that objective function Ψ is minimised, where Ψ= sum of edges within clusters and + edges across clusters. NP-hard for k 2 [SST04]. PTAS for constant k with running time n O(9k /ε 2) log n [GG06].

k-means Clustering Beyond worst case Beyond worst-case Separating mixture of Gaussians. Clustering under separation in the context of k-means clustering. Clustering in semi-supervised setting where the clustering algorithm is allowed to make queries during its execution.

Semi-Supervised Active Clustering (SSAC) Same-cluster queries Beyond worst-case Mixture of Gaussians. Clustering under separation. Clustering in semi-supervised setting where the clustering algorithm is allowed to make queries during its execution. Semi-Supervised Active Clustering (SSAC) [AKBD16]: In the context of the k-means problem, the clustering algorithm is given the dataset X R d and integer k (as in the classical setting) and it can make same-cluster queries.

Semi-Supervised Active Clustering (SSAC) Same-cluster queries SSAC framework: Same-cluster queries for correlation clustering. Figure: SSAC framework: same-cluster queries

Semi-Supervised Active Clustering (SSAC) Same-cluster queries SSAC framework: Same-cluster queries for correlation clustering. Figure: SSAC framework: same-cluster queries A limited number of such queries (or some weaker version) may be feasible in certain settings. So, understanding the power and limitations of this idea may open interesting future directions.

Semi-Supervised Active Clustering (SSAC) Known results for k-means Clearly, we can output optimal clustering using O(n 2 ) same-cluster queries. Can we cluster using fewer queries? The following result is already known for the SSAC setting in the context of k-means problem. Theorem (Informally stated theorem from [AKBD16]) There is a randomised algorithm that runs in time O(kn log n) and makes O(k 2 log k + k log n) same-cluster queries and returns the optimal k-means clustering for any dataset X R d that satisfies some separation guarantee.

Semi-Supervised Active Clustering (SSAC) Known results for k-means The following result is already known for the SSAC setting in the context of k-means problem. Theorem (Informally stated theorem from [AKBD16]) There is a randomised algorithm that runs in time O(kn log n) and makes O(k 2 log k + k log n) same-cluster queries and returns the optimal k-means clustering for any dataset X R d that satisfies some separation guarantee. Ailon et al. [ABJK18] extend the above results to approximation setting while removing the separation condition with: Running time: O(nd poly(k/ε)) # same-cluster queries: poly(k/ε) (independent of n) Question: Can we obtain similar results for correlation clustering?

MinDisAgree[k] within SSAC MinDisAgree[k] Given a complete graph with ± labeled edges and k, find a clustering of the vertices such that objective function Ψ is minimised, where Ψ= sum of edges within clusters and + edges across clusters. ( ) (1 + ε)-approximate algorithm with running time n O 9 k ε 2 log n [GG06]. Theorem (Main result upper bound) There is a randomised query algorithm that runs in time O(poly( k ε ) n log n) and makes O(poly(k ε ) log n) same-cluster queries and outputs a (1 + ε)-approximate solution for MinDisAgree[k].

MinDisAgree[k] within SSAC ( ) (1 + ε)-approximate algorithm with running time n O 9 k ε 2 log n [GG06]. Theorem (Main result upper bound) There is a randomised query algorithm that runs in time O(poly( k ε ) n log n) and makes O(poly(k ε ) log n) same-cluster queries and outputs a (1 + ε)-approximate solution for MinDisAgree[k]. Theorem (Main result - running time lower bound) If the Exponential Time Hypothesis (ETH) holds, then there is a constant δ > 0 such that any (1 + δ)-approximation algorithm for MinDisAgree[k] runs in time 2 Ω( k poly log k ) -time.

MinDisAgree[k] within SSAC ( ) (1 + ε)-approximate algorithm with running time n O 9 k ε 2 log n [GG06]. Theorem (Main result upper bound) There is a randomised query algorithm that runs in time O(poly( k ε ) n log n) and makes O(poly(k ε ) log n) same-cluster queries and outputs a (1 + ε)-approximate solution for MinDisAgree[k]. Theorem (Main result - running time lower bound) If the Exponential Time Hypothesis (ETH) holds, then there is a constant δ > 0 such that any (1 + δ)-approximation algorithm for MinDisAgree[k] runs in time 2 Ω( k poly log k ) -time. Theorem (Main result - query lower bound) If the Exponential Time Hypothesis (ETH) holds, then there is a constant δ > 0 such that any (1 + δ)-approximation algorithm for MinDisAgree[k] within the SSAC framework that runs in polynomial k time makes Ω( ) same-cluster queries. poly log k

MinDisAgree[k] within SSAC Theorem (Main result - running time lower bound) If the Exponential Time Hypothesis (ETH) holds, then there is a constant δ > 0 such that any (1 + δ)-approximation algorithm for MinDisAgree[k] runs in time 2 Ω( k poly log k ) -time. Chain of reductions for lower bounds Dinur PCP ETH E3-SAT E3-SAT NAE6-SAT NAE6-SAT NAE3-SAT NAE3-SAT Monotone NAE3-SAT Monotone NAE3-SAT 2-colorability of 3-uniform bounded degree hypergraph. 2-colorability of 3-uniform bounded degree hypergraph [CGW05] MinDisAgree[k]

MinDisAgree[k] within SSAC Theorem (Main result upper bound) There is a randomised query algorithm that runs in time O(poly( k ε ) n log n) and makes O(poly(k ε ) log n) same-cluster queries and outputs a (1 + ε)-approximate solution for MinDisAgree[k]. Main ideas - Through a simple observation about PTAS of Giotis and Guruswami[GG06].

MinDisAgree[k] within SSAC Theorem (Main result upper bound) There is a randomised query algorithm that runs in time O(poly( k ε ) n log n) and makes O(poly(k ε ) log n) same-cluster queries and outputs a (1 + ε)-approximate solution for MinDisAgree[k]. Main ideas - Through a simple observation about PTAS of Giotis and Guruswami[GG06].

Future Directions Future directions: Gap in query upper and lower bounds. Faulty-query setting.

References I Nir Ailon, Anup Bhattacharya,, and Amit Kumar, Approximate clustering with same-cluster quesries, Proceedings of the ninth Innovations in Theoretical Computer Science (ITCS 18), 2018. Hassan Ashtiani, Shrinu Kushagra, and Shai Ben-David, Clustering with same-cluster queries, Advances in neural information processing systems, 2016, pp. 3216 3224. Nikhil Bansal, Avrim Blum, and Shuchi Chawla, Correlation clustering, Machine Learning 56 (2004), no. 1-3, 89 113. Moses Charikar, Venkatesan Guruswami, and Anthony Wirth, Clustering with qualitative information, Journal of Computer and System Sciences 71 (2005), no. 3, 360 383. Ioannis Giotis and Venkatesan Guruswami, Correlation clustering with a fixed number of clusters, Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, Society for Industrial and Applied Mathematics, 2006, pp. 1167 1176. Ron Shamir, Roded Sharan, and Dekel Tsur, Cluster graph modification problems, Discrete Applied Mathematics 144 (2004), no. 1, 173 182, Discrete Mathematics and Data Mining.

Thank you