Learning a classification of Mixed-Integer Quadratic Programming problems

Similar documents
Recent Developments in Model-based Derivative-free Optimization

Lecture 25 Nonlinear Programming. November 9, 2009

Introduction to Mathematical Programming IE406. Lecture 20. Dr. Ted Ralphs

Embedding Formulations, Complexity and Representability for Unions of Convex Sets

MVE165/MMG630, Applied Optimization Lecture 8 Integer linear programming algorithms. Ann-Brith Strömberg

Equation to LaTeX. Abhinav Rastogi, Sevy Harris. I. Introduction. Segmentation.

Embedded Optimization for Mixed Logic Dynamical Systems

CS 229 Midterm Review

Cutting Planes for Some Nonconvex Combinatorial Optimization Problems

Simulation. Lecture O1 Optimization: Linear Programming. Saeed Bastani April 2016

Algorithms for Decision Support. Integer linear programming models

Discriminative Clustering for Image Co-Segmentation

Research Interests Optimization:

A Brief Look at Optimization

Handling of constraints

Allstate Insurance Claims Severity: A Machine Learning Approach

LaGO - A solver for mixed integer nonlinear programming

Financial Optimization ISE 347/447. Lecture 13. Dr. Ted Ralphs

Ensemble methods in machine learning. Example. Neural networks. Neural networks

A NEW SEQUENTIAL CUTTING PLANE ALGORITHM FOR SOLVING MIXED INTEGER NONLINEAR PROGRAMMING PROBLEMS

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Introduction to Mathematical Programming IE496. Final Review. Dr. Ted Ralphs

Network Traffic Measurements and Analysis

10-701/15-781, Fall 2006, Final

Marcia Fampa Universidade Federal do Rio de Janeiro Rio de Janeiro, RJ, Brazil

Contents. Preface to the Second Edition

Variable Selection 6.783, Biomedical Decision Support

On the Global Solution of Linear Programs with Linear Complementarity Constraints

FINITE DISJUNCTIVE PROGRAMMING CHARACTERIZATIONS FOR GENERAL MIXED-INTEGER LINEAR PROGRAMS

Support Vector Machines.

Kernel Methods & Support Vector Machines

Robust Kernel Methods in Clustering and Dimensionality Reduction Problems

Heuristics in Commercial MIP Solvers Part I (Heuristics in IBM CPLEX)

The Curse of Dimensionality

Recursive column generation for the Tactical Berth Allocation Problem

Heuristics in MILP. Group 1 D. Assouline, N. Molyneaux, B. Morén. Supervisors: Michel Bierlaire, Andrea Lodi. Zinal 2017 Winter School

Two-phase matrix splitting methods for asymmetric and symmetric LCP

Foundations of Computing

Applying Supervised Learning

Lecture Linear Support Vector Machines

A Lifted Linear Programming Branch-and-Bound Algorithm for Mixed Integer Conic Quadratic Programs

Adversarial Attacks on Image Recognition*

Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Support Vector Machines

Class 6 Large-Scale Image Classification

PV211: Introduction to Information Retrieval

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Machine Learning in Python. Rohith Mohan GradQuant Spring 2018

Bipartite Edge Prediction via Transductive Learning over Product Graphs

Support Vector Machines

Stochastic Separable Mixed-Integer Nonlinear Programming via Nonconvex Generalized Benders Decomposition

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Support Vector Machines.

An FPTAS for minimizing the product of two non-negative linear cost functions

Support vector machines

2. On classification and related tasks

The University of Jordan Department of Mathematics. Branch and Cut

IE598 Big Data Optimization Summary Nonconvex Optimization

MVE165/MMG631 Linear and integer optimization with applications Lecture 9 Discrete optimization: theory and algorithms

Unlabeled Data Classification by Support Vector Machines

Multiple-Choice Questionnaire Group C

Tutorial on Integer Programming for Visual Computing

SVM: Multiclass and Structured Prediction. Bin Zhao

An extended supporting hyperplane algorithm for convex MINLP problems

Optimization. Industrial AI Lab.

Development in Object Detection. Junyuan Lin May 4th

DM6 Support Vector Machines

Clustering: Classic Methods and Modern Views

Large synthetic data sets to compare different data mining methods

The Heuristic (Dark) Side of MIP Solvers. Asja Derviskadic, EPFL Vit Prochazka, NHH Christoph Schaefer, EPFL

Fast Prediction of Solutions to an Integer Linear Program with Machine Learning

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

TMA946/MAN280 APPLIED OPTIMIZATION. Exam instructions

Semi-supervised learning and active learning

OPTIMIZATION. joint course with. Ottimizzazione Discreta and Complementi di R.O. Edoardo Amaldi. DEIB Politecnico di Milano

/ Approximation Algorithms Lecturer: Michael Dinitz Topic: Linear Programming Date: 2/24/15 Scribe: Runze Tang

Beyond Bags of features Spatial information & Shape models

Support Vector Machines

Exploiting Low-Rank Structure in Semidenite Programming by Approximate Operator Splitting

Thorsten Joachims Then: Universität Dortmund, Germany Now: Cornell University, USA

An extended supporting hyperplane algorithm for convex MINLP problems

Minimum Weight Constrained Forest Problems. Problem Definition

The K-modes and Laplacian K-modes algorithms for clustering

Advanced Operations Research Techniques IE316. Quiz 1 Review. Dr. Ted Ralphs

Math Models of OR: The Simplex Algorithm: Practical Considerations

Week 5. Convex Optimization

Fundamentals of Integer Programming

Mathematical Programming and Research Methods (Part II)

LaGO. Ivo Nowak and Stefan Vigerske. Humboldt-University Berlin, Department of Mathematics

extreme Gradient Boosting (XGBoost)

Generalized trace ratio optimization and applications

Applied Lagrange Duality for Constrained Optimization

Discrete Optimization with Decision Diagrams

Spectral Clustering on Handwritten Digits Database

5 Learning hypothesis classes (16 points)

MVE165/MMG631 Linear and integer optimization with applications Lecture 7 Discrete optimization models and applications; complexity

B553 Lecture 12: Global Optimization

Dynamic Routing Between Capsules

SOLVING LARGE-SCALE COMPETITIVE FACILITY LOCATION UNDER RANDOM UTILITY MAXIMIZATION MODELS

Transcription:

Learning a classification of Mixed-Integer Quadratic Programming problems CERMICS 2018 June 29, 2018, Fréjus Pierre Bonami 1, Andrea Lodi 2, Giulia Zarpellon 2 1 CPLEX Optimization, IBM Spain 2 Polytechnique Montréal, CERC Data Science for real-time Decision Making

Table of contents 1. About Mixed-Integer Quadratic Programming problems 2. Data and experiments 3. Ongoing questions

About Mixed-Integer Quadratic Programming problems

Mixed-Integer Quadratic Programming problems We consider Mixed-Integer Quadratic Programming (MIQP) pbs. min 1 2 x T Qx + c T x Ax = b x i {0, 1} l x u i I Modeling of practical applications (e.g., portfolio optimization) First extension of linear algorithms into nonlinear ones (1) 1

Mixed-Integer Quadratic Programming problems We consider Mixed-Integer Quadratic Programming (MIQP) pbs. min 1 2 x T Qx + c T x Ax = b x i {0, 1} l x u i I Modeling of practical applications (e.g., portfolio optimization) First extension of linear algorithms into nonlinear ones (1) We say an MIQP is convex (resp. nonconvex ) if and only if the matrix Q is positive semi-definite, Q 0 (resp. indefinite, Q 0). IBM-CPLEX solver can solve both convex and nonconvex MIQPs to proven optimality 1

Solving MIQPs with CPLEX Convex 0-1 NLP B&B Convex mixed NLP B&B 2

Solving MIQPs with CPLEX Convex 0-1 NLP B&B Convex mixed NLP B&B Nonconvex 0-1 convexify + NLP B&B Convexification: augment diagonal of Q, using x i = x 2 i for x i {0, 1}: x T Qx x T (Q + ρi n )x ρe T x, where Q + ρi n 0 for some ρ > 0 2

Solving MIQPs with CPLEX Convex 0-1 NLP B&B Convex mixed NLP B&B Nonconvex 0-1 convexify + NLP B&B linearize + MILP B&B Convexification: augment diagonal of Q, using x i = x 2 i for x i {0, 1}: x T Qx x T (Q + ρi n )x ρe T x, where Q + ρi n 0 for some ρ > 0 Linearization: replace q ij x i x j where x i {0, 1} and l j x j u j with a new variable y ij and McCormick inequalities Linearization is always full in 0-1 case 2

Solving MIQPs with CPLEX Convex 0-1 NLP B&B Convex mixed NLP B&B Nonconvex 0-1 convexify + NLP B&B linearize + MILP B&B Nonconvex mixed Convexification is relaxation - Spatial B&B Convexification: augment diagonal of Q, using x i = x 2 i for x i {0, 1}: x T Qx x T (Q + ρi n )x ρe T x, where Q + ρi n 0 for some ρ > 0 Linearization: replace q ij x i x j where x i {0, 1} and l j x j u j with a new variable y ij and McCormick inequalities Linearization is always full in 0-1 case 2

Solving MIQPs with CPLEX Convex 0-1 NL: NLP B&B L: linearize + MILP B&B Convex mixed NL: NLP B&B L: linearize + MILP B&B linearize + NLP B&B Nonconvex 0-1 NL: convexify + NLP B&B L: linearize + MILP B&B Convexification: augment diagonal of Q, using x i = x 2 i for x i {0, 1}: x T Qx x T (Q + ρi n )x ρe T x, where Q + ρi n 0 for some ρ > 0 Linearization: replace q ij x i x j where x i {0, 1} and l j x j u j with a new variable y ij and McCormick inequalities Linearization is always full in 0-1 MIQP (may not for mixed ones) 3

Linearize vs. not linearize The linearization approach seems beneficial also for the convex case, but is linearizing always the best choice? 4

1 Fourer R. Quadratic Optimization Mysteries, Part 1: Two Versions. 4 Linearize vs. not linearize The linearization approach seems beneficial also for the convex case, but is linearizing always the best choice? Example 1 : convex 0-1 MIQP, n = 200 Linearize Not linearize

Linearize vs. not linearize The linearization approach seems beneficial also for the convex case, but is linearizing always the best choice? Example 1 : convex 0-1 MIQP, n = 200 Linearize Not linearize [... ] when one looks at a broader variety of test problems the decision to linearize (vs. not linearize) does not appear so clear-cut. 2 1 Fourer R. Quadratic Optimization Mysteries, Part 1: Two Versions. 2 Fourer R. Quadratic Optimization Mysteries, Part 2: Two Formulations. http://bob4er.blogspot.com/2015/03/quadratic-optimization-mysteries-part-2.html 4

Linearize vs. not linearize - Goal Goal Exploit ML predictive machinery to understand whether it is favorable to linearize the quadratic part of an MIQP or not. Learn an offline classifier predicting the most suited resolution approach within CPLEX framework, in an instance-specific way 5

Linearize vs. not linearize - Goal Goal Exploit ML predictive machinery to understand whether it is favorable to linearize the quadratic part of an MIQP or not. Learn an offline classifier predicting the most suited resolution approach within CPLEX framework, in an instance-specific way Restrict to three types of problems 0-1 convex, mixed convex, 0-1 nonconvex 5

Linearize vs. not linearize - Goal Goal Exploit ML predictive machinery to understand whether it is favorable to linearize the quadratic part of an MIQP or not. Learn an offline classifier predicting the most suited resolution approach within CPLEX framework, in an instance-specific way Restrict to three types of problems 0-1 convex, mixed convex, 0-1 nonconvex Parameter qtolin controls the linearization switch L on CPLEX linearizes (all?) quadratic terms NL off CPLEX does not linearize quadratic terms DEF auto Let CPLEX decide (default) 5

Data and experiments

Steps to apply supervised learning Dataset generation Generator of MIQPs, spanning over various parameters Q = sprandsym(size, density, eigenvalues) 6

Steps to apply supervised learning Dataset generation Generator of MIQPs, spanning over various parameters Q = sprandsym(size, density, eigenvalues) Features design Static features (21) Mathematical characteristics (variables, constraints, objective, spectrum,... ) Dynamic features (2) Early behavior in optimization process (bounds and times at root node) 6

Steps to apply supervised learning Dataset generation Generator of MIQPs, spanning over various parameters Q = sprandsym(size, density, eigenvalues) Features design Static features (21) Mathematical characteristics (variables, constraints, objective, spectrum,... ) Dynamic features (2) Early behavior in optimization process (bounds and times at root node) Labeling procedure Consider tie cases Labels in {L, NL, T} 1h, 5 seeds Solvability and consistency checks Look at runtimes to assign a label 6

Steps to apply supervised learning Dataset generation Generator of MIQPs, spanning over various parameters Q = sprandsym(size, density, eigenvalues) Features design Static features (21) Mathematical characteristics (variables, constraints, objective, spectrum,... ) Dynamic features (2) Early behavior in optimization process (bounds and times at root node) Labeling procedure Consider tie cases Labels in {L, NL, T} 1h, 5 seeds Solvability and consistency checks Look at runtimes to assign a label {(x k, y k )} k=1..n where x k R d, y k {L, NL, T} for N MIQPs 6

Dataset D (nutshell) analysis 2300 instances, n {25, 50,..., 200}, density d {0.2, 0.4,..., 1} 7

Dataset D (nutshell) analysis 2300 instances, n {25, 50,..., 200}, density d {0.2, 0.4,..., 1} Multiclass classifiers: SVM and Decision Tree based methods (Random Forests (RF) Extremely Randomized Trees (EXT) Gradient Tree Boosting (GTB)) Avoid overfitting with ML best practices Tool: scikit-learn library 7

Learning results on D test Classifiers perform well with respect to traditional classification measures: D test - Multiclass - All features SVM RF EXT GTB Accuracy 0.85 0.89 0.84 0.87 Precision 0.82 0.85 0.81 0.85 Recall 0.85 0.89 0.84 0.87 F1 score 0.83 0.87 0.82 0.86 A major difficulty is posed by the T class, (almost) always misclassified 8

Learning results on D test Classifiers perform well with respect to traditional classification measures: D test - Multiclass - All features SVM RF EXT GTB Accuracy 0.85 0.89 0.84 0.87 Precision 0.82 0.85 0.81 0.85 Recall 0.85 0.89 0.84 0.87 F1 score 0.83 0.87 0.82 0.86 A major difficulty is posed by the T class, (almost) always misclassified Binary setting: remove all tie cases performance improved How relevant are ties with respect to the question L vs. NL? 8

Hints of features importance Ensemble methods based on Decision Trees provide an importance score for each feature. Top scoring ones are dynamic ft.s and those about eigenvalues: (dyn. ft.) Difference of lower bounds for L and NL at root node (dyn. ft.) Difference of resolution times of the root node, for L and NL Value of smallest nonzero eigenvalue Spectral norm of Q, i.e., Q = max i λ i... 9

Hints of features importance Ensemble methods based on Decision Trees provide an importance score for each feature. Top scoring ones are dynamic ft.s and those about eigenvalues: (dyn. ft.) Difference of lower bounds for L and NL at root node (dyn. ft.) Difference of resolution times of the root node, for L and NL Value of smallest nonzero eigenvalue Spectral norm of Q, i.e., Q = max i λ i... Static features setting: remove dynamic features performance slightly deteriorated How does the prediction change without information at root node? 9

Measure the optimization gain Need Evaluate classifiers performance in optimization terms, and quantify the gain with respect to CPLEX default strategy. 10

Measure the optimization gain Need Evaluate classifiers performance in optimization terms, and quantify the gain with respect to CPLEX default strategy. Run each test example once more, for all configurations of qtolin (on/l, off/nl, DEF) and collect resolution times 10

Measure the optimization gain Need Evaluate classifiers performance in optimization terms, and quantify the gain with respect to CPLEX default strategy. Run each test example once more, for all configurations of qtolin (on/l, off/nl, DEF) and collect resolution times For each classifier clf and DEF, build a times vector t clf : for every test example, select the runtime corresponding to its label {L, NL, T}, as predicted by clf Average L and NL times in case of T being predicted 10

Measure the optimization gain Need Evaluate classifiers performance in optimization terms, and quantify the gain with respect to CPLEX default strategy. Run each test example once more, for all configurations of qtolin (on/l, off/nl, DEF) and collect resolution times For each classifier clf and DEF, build a times vector t clf : for every test example, select the runtime corresponding to its label {L, NL, T}, as predicted by clf Average L and NL times in case of T being predicted Build also t best (t worst ) containing times corresponding to the correct (wrong) labels 10

Complementary optimization measures Using times vectors t clf define σ clf Nσ clf Sum of predicted runtimes: sum over times in t clf [0, 1] Normalized time score: shifted geometric mean of times in t clf, normalized between best and worst cases 11

Complementary optimization measures Using times vectors t clf define σ clf Nσ clf Sum of predicted runtimes: sum over times in t clf [0, 1] Normalized time score: shifted geometric mean of times in t clf, normalized between best and worst cases SVM RF EXT GTB DEF σ DEF /σ clf 3.88 4.40 4.04 4.26 Nσ clf 0.98 0.99 0.98 0.99 0.42 11

Complementary optimization measures Using times vectors t clf define σ clf Nσ clf Sum of predicted runtimes: sum over times in t clf [0, 1] Normalized time score: shifted geometric mean of times in t clf, normalized between best and worst cases SVM RF EXT GTB DEF σ DEF /σ clf 3.88 4.40 4.04 4.26 Nσ clf 0.98 0.99 0.98 0.99 0.42 DEF could take up to 4x more time to run MIQPs of D test, compared to a trained classifier Measures are better for classifiers hitting timelimit less frequently (and both L and NL reach timelimit multiple times!) 11

Ongoing questions

What about other datasets? 12

What about other datasets? Selection from QPLIB 24 instances 12

What about other datasets? Selection from QPLIB 24 instances Part of CPLEX internal testbed 175 instances, used as new test set C test for classifiers trained on the synthetic data. Again unbalanced, but with majority of ties and few NL. 12

Results on C test C test is dominated by few structured combinatorial instances Very different distribution of features All classifiers perform very poorly in terms of classification measures (most often T is predicted as NL), but... 13

Results on C test C test is dominated by few structured combinatorial instances Very different distribution of features All classifiers perform very poorly in terms of classification measures (most often T is predicted as NL), but...... performance is not bad in optimization terms: SVM RF EXT GTB DEF σ DEF /σ clf 0.48 0.53 0.71 0.42 Nσ clf 0.75 0.90 0.91 0.74 0.96 Given the high presence of ties, runtimes for L and NL are most often comparable, so the loss in performance is not dramatic. 13

Why those predictions? 14

Why those predictions? The bound at the root node seems to decide the label! Convexification and linearization clearly affect formulation size formulation strength implementation efficacy...... each problem type might have its own decision function for the question L vs. NL 14

Why those predictions? The bound at the root node seems to decide the label! Convexification and linearization clearly affect formulation size formulation strength implementation efficacy...... each problem type might have its own decision function for the question L vs. NL More experiments to come: Employ a larger and heterogeneous dataset Go beyond preliminary features evaluation Define a custom loss/scoring function 14

Thanks! Questions? 14