Learning to Project in Multi-Objective Binary Linear Programming

Learnng to Project n Mult-Objectve Bnary Lnear Programmng Alvaro Serra-Altamranda Department of Industral and Management System Engneerng, Unversty of South Florda, Tampa, FL, 33620 USA, amserra@mal.usf.edu, http://www.eng.usf.edu/ amserra/ Had Charkhgard Department of Industral and Management System Engneerng, Unversty of South Florda, Tampa, FL, 33620 USA, hcharkhgard@usf.edu, http://www.eng.usf.edu/ hcharkhgard/ Iman Dayaran Culverhouse College of Busness, The Unversty of Alabama, Tuscaloosa, AL 35487 USA, dayaran@cba.ua.edu, https://culverhouse.ua.edu/news/drectory/man-dayaran/ Al Eshragh School of Mathematcal and Physcal Scences, The Unversty of Newcastle, Callaghan, NSW 2308 Australa, al.eshragh@newcastle.edu.au, https://www.newcastle.edu.au/profle/al-eshragh Sorna Javad Department of Industral and Management Systems Engneerng, Unversty of South Florda, Tampa, FL, 33620 USA, javads@mal.usf.edu In ths paper, we nvestgate the possblty of mprovng the performance of mult-objectve optmzaton soluton approaches usng machne learnng technques. Specfcally, we focus on mult-objectve bnary lnear programs and employ one of the most effectve and recently developed crteron space search algorthms, the so-called KSA, durng our study. Ths algorthm computes all nondomnated ponts of a problem wth p objectves by searchng on a projected crteron space,.e., a (p 1)-dmensonal crteron apace. We present an effectve and fast learnng approach to dentfy on whch projected space the KSA should work. We also present several generc features/varables that can be used n machne learnng technques for dentfyng the best projected space. Fnally, we present an effectve b-objectve optmzaton based heurstc for selectng the best subset of the features to overcome the ssue of overfttng n learnng. Through an extensve computatonal study over 2000 nstances of tr-objectve Knapsack and Assgnment problems, we demonstrate that an mprovement of up to 12% n tme can be acheved by the proposed learnng method compared to a random selecton of the projected space. Key words : Mult-objectve optmzaton, machne learnng, bnary lnear program, crteron space search algorthm, learnng to project Hstory : 1

1. Introducton Many real-lfe optmzaton problems nvolve multple objectve functons and they can be stated as follows: mn {z 1(x),..., z p (x)}, (1) x X where X R n represents the set of feasble solutons of the problem, and z 1 (x),..., z p (x) are p objectve functons. Because objectves are often competng n a mult-objectve optmzaton problem, an deal feasble soluton that can optmze all objectves at the same tme does not often exst n practce. Hence, when solvng such a problem, the goal s often generatng some (f not all) effcent solutons,.e., a feasble soluton n whch t s mpossble to mprove the value of one objectve wthout makng the value of any other objectve worse. The focus of ths study s on Mult-Objectve Bnary Lnear Programs (MOBLPs),.e., mult-objectve optmzaton problems n whch all decson varables are bnary and all objectve functons and constrants are lnear. In the last few years, sgnfcant advances have been made n the development of effectve algorthms for solvng MOBLPs, see for nstance Boland et al. (2015a,b, 2016, 2017b), Dächert et al. (2012), Dächert and Klamroth (2015), Fattah and Turkay (2017), Krlk and Sayın (2014), Özpeynrc and Köksalan (2010), Lokman and Köksalan (2013), Özlen et al. (2013), Przybylsk and Gandbleux (2017), Przybylsk et al. (2010), Soylu and Yıldız (2016), and Vncent et al. (2013). Many of the recently developed algorthms fall nto the category of crteron space search algorthms,.e., those that work n the space of objectve functons values. Hence, such algorthms are specfcally desgned to fnd all nondomnated ponts of a mult-objectve optmzaton problem,.e., the mage of an effcent soluton n the crteron space s beng referred to as a nondomnated pont. After computng each nondomnated pont, crteron space search algorthms remove the proporton of the crteron space domnated by the obtaned nondomnated pont and search for not-yet-found nondomnated ponts n the remanng space. In general, to solve a mult-objectve optmzaton problem, crteron space search algorthms solve a sequence of sngle-objectve optmzaton problems. Specfcally, when solvng a problem wth p objectve functons, many crteron space search algorthms frst attempt to transform the problem nto a sequence of problems wth (p 1) objectves (Boland et al. 2017b). In other words, they attempt to compute all nondomnated ponts by dscoverng 2

ther projectons n a (p 1)-dmensonal crteron space. Evdently, the same process can be appled recursvely untl a sequence of sngle-objectve optmzaton problems are generated. For example, to solve each problem wth (p 1) objectves, a sequence of problems wth (p 2) objectves can be solved. Overall, there are at least two possble ways to apply the projecton from a hgher dmensonal crteron space (for example p) to a crteron space wth one less dmenson (for example p 1): Weghted Sum Projecton: A typcal approach used n the lterature (see for nstance Özlen and Azzoğlu 2009) s to select one of the objectve functons avalable n the hgher dmenson (for example z 1 (x)) and remove t after addng t wth some strctly postve weght to the other objectve functons. In ths case, by mposng dfferent bounds for z 1 (x) and/or the value of the other objectve functons, a sequence of optmzaton problems wth p 1 objectves wll be generated. Lexcographcal Projecton: We frst note that a lexcographcal optmzaton problem s a two-stage optmzaton problem that attempts to optmze a set of objectves, the so-called secondary objectves, over the set of solutons that are optmal for another objectve, the so-called prmary objectve. The frst stage n the lexcographcal optmzaton problem s a sngle-objectve optmzaton problem as t optmzes the prmary goal. The second stage, however, can be a mult-objectve optmzaton problem as t optmzes the secondary objectves. Based on ths defnton, another typcal approach (see for nstance Özlen et al. (2013)) for projecton s to select one of the objectve functons avalable n the hgher dmenson (for example z 1 (x)) and smply remove t. In ths case, by mposng dfferent bounds for z 1 (x) and/or the value of the other objectve functons, a sequence of lexcographcal optmzaton problems should be solved n whch z 1 (x) s the prmary objectve and the remanng p 1 objectves are secondary objectves. In lght of the above, whch objectve functon should be selected for dong a projecton and how to do a projecton are two typcal questons that can be asked when developng a crteron space search algorthm. So, by ths observaton, there are many possble ways to develop a crteron space search algorthm and some of whch may perform better for some nstances. So, the underlyng research queston of ths study s that whether Machne Learnng (ML) technques can help us answer the above questons for a gven class of nstances of a mult-objectve objectve optmzaton problem? 3

It s worth mentonng that, n recent years, smlar questons have been asked n the feld of sngle-objectve optmzaton. For example, ML technques have been successfully mplemented for the purpose of varable selecton and node selecton n branch-and-bound algorthms (see for nstance Khall et al. (2016), Alvarez et al. (2017), Sabharwal et al. (2012), He et al. (2014), Khall et al. (2017)). However, stll the majorty of the algorthmc/theoretcal studes n the feld of ML have been focused on usng optmzaton models and algorthms to enhance ML technques and not the other way around (see for nstance Roth and Yh (2005), Bottou (2010), Le et al. (2011), Sra et al. (2012), Snoek et al. (2012), Bertsmas et al. (2016)). In general, to the best of our knowledge, there are no studes n the lterature that address the problem of enhancng mult-objectve optmzaton algorthms usng ML. In ths study, as the frst attempt, we focus only on the smplest and most hgh-level queston that can be asked, that s for a gven nstance of MOBLP wth p objectve functons whch objectve should be removed for reducng the dmenson of the crteron space to p 1 n order to mnmze the soluton tme? It s evdent that f one can show that ML s even valuable for such a hgh-level queston then deeper questons can be asked and explored that can possbly mprove the soluton tme sgnfcantly. In order to answer the above queston, we employ one of the effectve state-of-the-art algorthms n the lterature of mult-objectve optmzaton, the so-called KSA whch s developed by Krlk and Sayın (2014). Ths algorthm uses the lexcographcal projecton for reducng the p-dmensonal crteron space to p 1, and then t recursvely reduces the dmenson from p 1 to 1 by usng a specal case of the weghted sum projecton n whch all the weghts are equal to one. Currently, the default objectve functon for conductng the projecton from p- dmensonal crteron space to p 1 s the frst objectve functon (or better say random because one can change the order of the objectve functons n an nput fle). So, a natural queston s that does t really matter whch objectve functon s selected for such a projecton? To answer ths queston, we conducted a set of experments by usng a C++ mplementaton of the KSA whch s publcly avalable n http://home.ku.edu.tr/ ~moolbrary and recorded the number of ILPs solved (#ILPs) and the computatonal tme (n seconds). We generated 1000 nstances (200 per class) of tr-objectve Assgnment Problem (AP) and 1000 nstances (200 per class) of Knapsack Problem (KP) based on the 4

procedure descrbed by Krlk and Sayın (2014). Table 1 shows the mpact of projectng based on the worst and best objectve functon usng the KSA where #ILPs s the number of sngle-objectve nteger lnear programs solved. Numbers reported n ths table are averages over 200 nstances. Table 1 Type #Objectves #Varables Projectng based on dfferent objectves usng the KSA. Projectng worst objectve Projectng best objectve %Decrease Run tme (s.) #ILPs Run tme (s.) #ILPs Run tme (s.) #ILPs AP 3 20 20 351.56 5,122.67 337.03 5,015.39 4.31% 4.61% 25 25 948.21 9,685.69 912.07 9,500.52 3.96% 4.40% 30 30 2,064.34 16,294.37 1,988.64 16,019.64 3.81% 4.01% 35 35 4,212.69 26,161.34 4,050.44 25,592.13 4.01% 4.27% 40 40 6,888.45 35,737.21 6,636.79 35,061.41 3.79% 3.97% KP 3 60 270.34 3,883.45 212.41 3,861.68 27.27% 1.05% 70 813.31 6,182.04 638.87 6,158.20 27.31% 0.82% 80 1,740.80 9,297.96 1,375.58 9,265.09 26.55% 0.74% 90 5,109.56 14,257.32 3,917.32 14,212.35 30.44% 0.63% 100 10,451.97 19,420.06 7,780.96 19,366.88 34.33% 0.57% We observe that, on average, the runnng tme can be reduced up to 34% whle the #ILPs can be mproved up to 4%. Ths numercal study clearly shows the mportance of projecton n the soluton tme. Hence, t s certanly worth studyng ML technques n predctng the best objectve functon for projectng, the so-called learnng to project. So, our man contrbuton n ths study s to ntroduce an ML framework to smulate the selecton of the best objectve functon to project. We collect data from each objectve functon and ther nteractons wth the decson space to create features. Based on the created features, an easy-to-evaluate functon s learned to emulate the classfcaton of the projectons. Another contrbuton of ths study s developng a smple but effectve b-objectve optmzaton-based heurstc approach to select the best subset of features to overcome the ssue of overfttng. We show that the accuracy of the proposed predcton model can reach up to around 72%, whch represents up to 12% mprovement n soluton tme. The rest of ths paper s organzed as follows. In Secton 2, some useful concepts and notatons about mult-objectve optmzaton are ntroduced and also a hgh-level descrpton of the KSA s gven. In Secton 3, we provde a hgh-level descrpton of our proposed 5

machne learnng framework and ts three man components. In Secton 4, the frst component of the framework, whch s a pre-orderng approach for changng the order of the objectve functons n an nput fle, s explaned. In Secton 5, the second component of the framework that ncludes features and labels are explaned. In Secton 6, the thrd/last component of the framework, whch s a b-objectve optmzaton based heurstc for selectng the best subset of features, s ntroduced. In Secton 7, we provde a comprehensve computatonal study. Fnally, n Secton 8, we provde some concludng remarks. 2. Prelmnares A Mult-Objectve Bnary Lnear Program (MOBLP) s a problem of the form (1) n whch X := { x {0, 1} n : Ax b } represents the feasble set n the decson space, A R m n, and b R m. It s assumed that X s bounded and z (x) = c x where c R n for = 1, 2,..., p represents a lnear objectve functon. The mage Y of X under vector-valued functon z := (z 1, z 2,..., z p ) represents the feasble set n the objectve/crteron space, that s Y := {o R p : o = z(x) for all x X }. Throughout ths artcle, vectors are always column-vectors and denoted n bold font. Defnton 1. A feasble soluton x X s called effcent or Pareto optmal, f there s no other x X such that z (x ) z (x) for = 1,..., p and z(x ) z(x). If x s effcent, then z(x) s called a nondomnated pont. The set of all effcent solutons x X s denoted by X E. The set of all nondomnated ponts z(x) Y for some x X E s denoted by Y N and referred to as the nondomnated fronter. Overall, mult-objectve optmzaton s concerned wth fndng all nondomnated ponts,.e., an exact representaton of the elements of Y N. The set of nondomnated ponts of a MOBLPs s fnte (snce by assumpton X s bounded). However, due to the exstence of unsupported nondomnated ponts,.e., those nondomnated ponts that cannot be obtaned by optmzng any postve weghted summaton of the objectve functons over the feasble set, computng all nondomnated ponts s challengng. One of the effectve crteron space search algorthms for MOBLPs s the KSA and ts hgh-level descrpton s provded next. The KSA s bascally a varaton of the ε-constrant method for generatng the entre nondomnated fronter of mult-objectve nteger lnear programs. In each teraton, ths algorthm solves the followng lexcographcal optmzaton problem n whch the frst stage s: 6

ˆx arg mn { z 1 (x) : x X, z (x) u {2,..., p} }, where u 2,..., u p are user-defned upper bounds. If ˆx exsts,.e., the frst stage s feasble, then the followng second-stage problem wll be solved: ˆx arg mn { p z (x) : x X, z 1 (x) z 1 (ˆx), z (x) u {2,..., p} }. =2 The algorthm computes all nondomnated ponts by mposng dfferent values on u 2,..., u p n each teraton. Interested readers may refer to Krlk and Sayın (2014) for further detals about how values of u 2,..., u p wll be updated n each teraton. It s mportant to note that n the frst stage users can replace the objectve functon z 1 (x) wth any other arbtrary objectve functon,.e., z j (x) where j {1,....p}, and change the objectve functon of the second stage accordngly,.e., p =1: j z (x). As shown n Introducton, on average, the runnng tme can decrease up to 34% by choosng the rght objectve functon for the frst stage. So, the goal of the proposed machne learnng technque n ths study s to dentfy the best choce. As an asde, we note that to be consstent wth our explanaton of the lexcographc and/or weghted sum projectons n Introducton, the lexcographc optmzaton problem of the KSA s presented slghtly dfferently n ths secton. Specfcally, Krlk and Sayın (2014) use the followng optmzaton problem nstead of the second-stage problem (mentoned above): ˆx arg mn { p z (x) : x X, z 1 (x) = z 1 (ˆx), z (x) u {2,..., p} }. =1 However, one can easly observe that these two formulatons are equvalent. In other words, the lexcographc optmzaton problem ntroduced n ths secton s a just dfferent representaton of the one proposed by Krlk and Sayın (2014). 3. Machne learnng framework We now ntroduce our ML framework for learnng to project n MOBLPs. Our proposed framework s based on Mult-class Support Vector Machne (MSVM). In ths applcaton, MSVM learns a functon f : Φ Ω from a tranng set to predct whch objectve functon wll have the best performance n the frst stage of the KSA (for a MOBLP nstance), 7

where Φ s the feature map doman descrbng the MOBLP nstance and Ω := {1, 2,..., p} s the doman of the labels. A label y Ω ndcates the ndex of the objectve functon that should be selected. We do not explan MSVM n ths study but nterested readers may refer to Crammer and Snger (2001) and Tsochantards et al. (2004) for detals. We used the publcly avalable mplementaton of MSVM n ths study whch can be found n https://goo.gl/4hljyq. It s worth mentonng that we have used MSVM manly because t was performng well durng the course of ths study. In Secton 7.4, we also report results obtaned by replacng MSVM wth Random Forest (Breman 2001, Prnze and Van den Poel 2008) to show the performance of another learnng technque n our proposed framework. Also, we provde more reasons n Secton 7.4 about why MSVM s used n ths study. Overall, the proposed ML framework contans three man components: Component 1 : It s evdent that by changng the order of the objectve functons of a MOBLP nstance n an nput fle, the nstance remans the same. Therefore, n order to ncrease the stablty of the predcton of MSVM, we propose an approach to pre-order the objectve functons of each MOBLP nstance n an nput fle before feedng t to MSVM (see Secton 4). Component 2 : We propose several generc features that can be used to descrbe each MOBLP nstance. A hgh-level descrpton of the features can be found n Secton 5 and ther detaled descrptons can be found n Appendx A. Component 3 : We propose a b-objectve heurstc approach (see Secton 6) for selectng the best subset of features for each class of MOBLP nstances (whch are AP and KP n ths study). Our numercal results show that our approach selects around 15% of features based on the tranng set for each class of MOBLP nstances n practce. Note that dentfyng the best subset of features s helpful for overcomng the ssue of overfttng and mprovng the predcton accuracy (Charkhgard and Eshragh 2019, Tbshran 1996). The proposed ML framework uses the above components for tranng purposes. A detaled dscusson on the accuracy of the proposed framework on a testng set (for each class of MOBLP nstances) s gven n Secton 7. 4. A pre-orderng approach for objectve functons It s obvous that by changng the order of objectve functons n an nput fle correspondng to an nstance, a new nstance wll not be generated. In other words, only the nstance s represented dfferently n that case and hence ts nondomnated fronter wll reman the 8

same. Ths suggests that the vector of features that wll be extracted for any nstance should be ndependent of the order of the objectve functons. To address ths ssue, we propose to perform a pre-orderng (heurstc) approach before gvng an nstance to MSVM for tranng or testng purposes. That s, when users provde an nstance, we frst change ts nput fle by re-orderng the objectve functons before feedng t to the MSVM. Obvously, ths somehow stablzes the predcton accuracy of the proposed ML framework. In lght of the above, let 1 x := ( p =1 c 1 + 1,..., 1 p =1 c n + 1 ). In the proposed approach, we re-order the objectve functons n an nput fle n a nondecreasng order of c 1 x, c 2 x,..., c p x. Intutvely, c x can be vewed as the normalzaton score for objectve functon {1,..., p}. In the rest of ths paper, the vector c for = 1, 2,..., p s assumed to be ordered accordng to the proposed approach. 5. Features and label descrbng a MOBLP nstance Ths secton provdes a hgh-level explanaton of the features that we create to descrbe a MOBLP nstance and also how each nstance s labeled. To the best of our knowledge, there are no studes that ntroduce features to descrbe mult-objectve nstances, and hence the proposed features are new. 5.1. Features The effcency of our proposed ML approach reles on the features descrbng a MOBLP nstance. In other words, the features should be easy to compute and effectve. Based on ths observaton, we create only statc features,.e., those that are computed once usng just the nformaton provded by the MOBLP nstance. Note that we only consder statc features because the learnng process and the decson on whch objectve functon to select for projecton (n the KSA) have to take place before solvng the MOBLP nstance. Overall, due to the nature of our research, most of our features descrbe the objectve functons of the nstances. We understand that the objectve functons by themselves are a lmted source of nformaton to descrbe an nstance. Therefore, we also consder establshng some relatonshps between the objectve functons and the other characterstcs of the nstance n order to mprove the relablty of our features. 9

In lght of the above, a total of 5p 2 + 106p 50 features are ntroduced for descrbng each nstance of MOBLP. As an asde, because n our computatonal study p = 3, we have 313 features n total. Some of these features rely on the characterstcs of the objectve functons such as the magntude and the number of postve, negatve and zero coeffcents. We also consder features that establsh a relatonshp between the objectve functons usng some normalzaton technques, e.g., the pre-orderng approach used to order the objectve functons (see Secton 4). Other features are created based on some mathematcal and statstcal computatons that lnk the objectve functons wth the technologcal coeffcents and the rght-hand-sde values. We also defne features based on the area of the projected crteron space.e., the correspondng (p 1)-dmensonal crteron space, that needs to be explored when one of the objectves s selected for conductng the projecton. Note that, to compute such an area, several sngle-objectve bnary lnear programmng problems need to be solved. However, n order to reduce the complexty of the features extracton, we compute an approxmaton of the area-to-explore by optmzng the lnear relaxaton of the problems. Addtonally, we create features n whch the basc characterstcs of an nstance are descrbed, e.g., sze, number of varables, and number of constrants. The man dea of the features s to generate as much nformaton as possble n a smple way. We accomplsh ths by computng all the proposed features n polynomal tme for a gven nstance. The features are normalzed usng a t-statstc score. Normalzaton s performed by aggregatng subsets of features computed from a smlar source. Fnally, the values of the normalzed feature matrx are dstrbuted approxmately between -1 and 1. Interested readers can fnd a detaled explanaton of the features n Appendx A. 5.2. Labels Based on our research goal,.e., smulatng the selecton of the best objectve, we propose a mult-class nteger labelng scheme for each nstance, where y Ω s the label of the nstance and Ω = {1, 2,..., p} s the doman of the labels. The value of y classfes the nstance wth a label that ndcates the ndex of the best objectve functon for projecton based on the runnng tme of the KSA (when generatng the entre nondomnated fronter of the nstance). The label of each nstance s assgned as follows: y arg mn j {1,...,p} {RunnngTme j}, (2) 10

where RunnngTme j s the runnng tme of the nstance when the objectve functon j s used for projectng the crteron space. 6. Best subset selecton of features It s easy to observe that by ntroducng more (lnearly ndependent) features and retranng an ML model (to optmalty) ts predcton error,.e., error = 1 accuracy, on the tranng set decreases and eventually t becomes zero. Ths s because n that case we are provdng a larger degree of freedom to the ML model. However, ths s not necessarly the case for the testng set. In other words, by ntroducng more features, the ML model that wll be obtaned s often overftted to the tranng set and does not perform well on the testng set. So, the underlyng dea of the best subset selecton of features s to avod the ssue of overfttng. However, the key pont s that n a real-world scenaro we do not have access to the testng set. So, selectng the best subset of features should be done based the nformaton obtaned from the tranng set. In lght of the above, studyng the trade-off between the number of features and the predcton error of an ML model on the tranng set s helpful for selectng the best subset of features (Charkhgard and Eshragh 2019). However, computng such a tradeoff usng exact methods s dffcult n practce snce the total number of subsets (of features) s an exponental functon of the number of features. Therefore, n ths secton, we ntroduce a b-objectve optmzaton-based heurstc for selectng the best subset of features. The proposed approach has two phases: Phase I : In the frst phase, the algorthm attempts to approxmate the tradeoff. Specfcally, the algorthm computes an approxmated nondomnated fronter of a b-objectve optmzaton problem n whch ts conflctng objectves are mnmzng the number of features and mnmzng the predcton error on the tranng set. Phase II : In the second phase, the algorthm selects one of the approxmated nondomnated pont and ts correspondng MSVM model to be used for predcton on the testng set. We frst explan Phase I. To compute the approxmated nondomnated fronter, we run MSVM teratvely on a tranng set. In each teraton, one approxmated nondomnated pont wll be generated. The approxmated nondomnated pont obtaned n teraton t s denoted by (k t, e t ) where k t s the number of features n the correspondng predcton 11

model (obtaned by MSVM) and e t s the predcton error of the correspondng model on the tranng set. To compute the frst nondomnated pont, the proposed approach/algorthm assumes that all features are avalable and t runs the MVSM to obtan the parameters of the predcton model. We denote by W t the parameter of the predcton model obtaned by MSVM n teraton t. Note that W t s a p k t matrx where p s the number of objectves. Now consder an arbtrary teraton t. The algorthm wll explore the parameters of the predcton model obtaned n the prevous teraton by MSVM,.e., W t 1, and wll remove the least mportant feature based on W t 1. Hence, because of removng one feature, we have that k t = k t 1 1. Specfcally, each column of matrx W t 1 s assocated to a feature. Therefore, the algorthm computes the standard devaton of each column ndependently. The feature wth the mnmum standard devaton wll be selected and removed n teraton t. Note that MSVM creates a model for each objectve functon and that s the reason that matrx W t 1 has p rows. So, f the standard devaton of a column n the matrx s zero then we know that the correspondng feature s contrbutng exactly the same n all p models and therefore t can be removed. So, we observe that the standard devaton plays an mportant role n dentfyng the least mportant feature. Overall, after removng the least mportant feature, MSVM should be run agan for computng W t and e t. The algorthm for computng the approxmated nondomnated fronter termnates as soon as k t = 0. A detaled descrpton of the algorthm for computng the approxmated nondomnated fronter can be found n Algorthm 1. In the second phase, we select an approxmated nondomnated pont. However, before dong so, t s worth mentonng that MSVM can take a long tme to compute W t n each teraton of Algorthm 1. So, to avod ths ssue, users usually termnate MSVM before t reaches to an optmal soluton by mposng some termnaton condtons ncludng a relatve optmalty gap tolerance and adjustng the so-called regularzaton parameter (see Crammer and Snger (2001) and Tsochantards et al. (2004) for detals). In ths study, we set the tolerance to 0.1 and the regularzaton parameter to 5 10 4 (snce we numercally observed that MSVM performs better n ths case). Such lmtatons obvously mpact the predcton error that wll be obtaned on the tranng set,.e., e t. So, some of the ponts that wll be reported by Algorthm 1 may domnate each other. Therefore, n Phase II, we 12

Algorthm 1: Phase I: Computng an approxmated nondomnated fronter nput: Tranng set, The set of features 1 Queue.create(Q) 2 t 1 3 k t The ntal number of features 4 whle k t 0 do 5 f t 1 then 6 Fnd the least mportant feature from W t 1 and remove t from the set of features 7 k t k t 1 1 8 Compute W t by applyng MSVM on the tranng set usng the current set of features 9 Compute e t by applyng the obtaned predcton model assocated wth W t on the tranng set 10 Q.add ( (k t, e t ) ) 11 t t + 1 12 return Q frst remove the domnated ponts. In the remanng of ths secton we assume that there s no domnated pont n the approxmated nondomnated fronter. Next, the proposed approach selects an approxmated nondomnated pont that has the mnmum Eucldean dstance from the (magnary) deal pont,.e., an magnary pont n the crteron space that has the mnmum number of features as well as the mnmum predcton error. Such a technque s a specal case of optmzaton over the fronter (Abbas and Chaabane 2006, Jorge 2009, Boland et al. 2017a, Serra-Altamranda and Charkhgard 2019). We note that n b-objectve optmzaton, the deal pont can be computed easly based on the endponts of the (approxmated) nondomnated fronter. Let (k, e ) and (k, e ) be the two endponts n whch k < k and e > e. In ths case, the deal pont s (k, e ). Note too that because the frst and second objectves have dfferent scales, n ths study, we frst normalze all approxmated nondomnated ponts before selectng a pont. Let (k, e) be an arbtrary approxmated nondomnated pont. After normalzaton, ths pont wll be as follows: ( k k k k, e e e e Observe that the proposed normalzaton technque ensures that the value of each component of a pont wll be between 0 and 1. As a consequence, n ths case, the nomnalzed deal pont wll be always (0, 0). We wll dscuss about the effectveness of our proposed best subset selecton approach n the next secton. 13 ).

7. A computatonal study In ths secton, we conduct an extensve computatonal study to evaluate the performance of the KSA when the proposed ML technque s used for learnng to project. We generate 1000 tr-objectve AP nstances and also 1000 tr-objectve KP nstances based on the procedures descrbed by Krlk and Sayın (2014). Snce there are three objectves, we compute the entre representaton of the nondomnated fronter three tmes for each nstance usng the KSA; In each tme, a dfferent objectve functon wll be selected for projecton. We employ CPLEX 12.7 as the sngle-objectve bnary lnear programmng solver. All computatonal experments are carred out on a Dell PowerEdge R630 wth two Intel Xeon E5-2650 2.2 GHz 12-Core Processors (30MB), 128GB RAM, and the RedHat Enterprse Lnux 6.8 operatng system. We only use a sngle thread for all experments. Our experments are dvded nto three parts. In the frst part, we run our approach over the entre set of nstances usng 80% of the data as the tranng set and 20% of the data as the testng set. The second part evaluates the predcton models obtaned n the frst part on a reduced testng set. In other words, the tranng set s as same as the one n the frst part but the testng set s smaller. Specfcally, f t does not really matter whch objectve functon to be selected for projecton (n terms of soluton tme) for an nstance n the testng set of the frst part then we remove t. Obvously one can thnk of such nstances as te cases. In the thrd part of our experments, we extend the concept of reduced testng set to the tranng set. That s, we remove not only the te cases from the testng set but also from the tranng set. In general, the goal of reducng testng set and/or tranng set s to mprove the overall accuracy of the predcton model. At the end of the computatonal study, we replace MSVM by Random Forest n the proposed ML framework to show the performance of another learnng technque. We note that n ths computatonal study, we do not report any tme for our proposed ML framework because the aggregated tme of generatng the features, learnng process, and predctons for all 1000 nstances of a class of optmzaton problem,.e., AP and KP, s around 50 seconds. Ths mples that on average almost 0.05 seconds are spent on each nstance. 7.1. Complete tranng and testng sets The frst part of our experments are done on the entre tranng and testng sets. For each class of optmzaton problems,.e., KP and AP, the proposed subset selecton approach s run on ts correspondng tranng set. The proposed approach obtans the best subset of 14

features and ts correspondng predcton model for each class of nstances. Before provdng detaled explanatons about the accuracy of such a predcton model on the testng set, t s necessary to show that the proposed b-objectve optmzaton approach for selectng the best subset of features s ndeed effectve. Fgure 1 shows the approxmated nondomnated fronter (obtaned durng the course of the proposed best subset selecton approach) for each class of optmzaton problems. In ths fgure, small (red) plus symbols are the outputs of Algorthm 1. The (black) square on the vertcal axs shows the deal pont and fnally the (yellow) square on the approxmated nondomnated fronter shows the selected pont by the proposed method. Frst note that we ntroduced 313 generc features n ths paper but the tal of the approxmated nondomnated fronter n Fgure 1 clearly shows that not all 313 features are used. Ths s because some of the 313 features are not applcable to all classes and wll be removed automatcally before runnng the proposed best subset selecton approach. We observe from Fgure 1 that, overall, by ntroducng more features the predcton error deceases for the tranng set. It s evdent when all features are ncluded the accuracy,.e., 1 error, for the tranng set s around 59.5% and 70% for AP and KP nstances, respectvely. Of course ths s not surprsng because the learnng procedure wll be done based on the tranng set and ntroducng more features gves a larger degree of freedom to the learnng model. However, ths s not necessarly the case for the testng set. Bascally, by ntroducng more features, we may rase the ssue of overfttng,.e., the predcton error s small for the tranng set but large for the testng set. Error Fgure 1 0.56 0.54 0.52 0.5 0.48 0.46 0.44 0.42 0.4 0.38 0 50 100 150 200 250 300 Number of features Error 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0 50 100 150 200 250 300 Number of features (a) Tranng set of AP (b) Tranng set of KP An llustraton of the performance of the proposed approach for selectng the best subset of features on the complete tranng set 15

To show ths, for each of the ponts (other than the deal pont) n Fgure 1, we have plotted ts correspondng pont for the testng set n Fgure 2. Specfcally, for each pont n Fgure 1, we run ts correspondng model on the testng set to compute the error. From Fgure 2 we observe that the error hghly fluctuates. In fact, t s evdent that for AP nstances, the predcton model that has around 40 features s the best predcton model. Smlarly, from Fgure 2, t s evdent that for KP nstances, the predcton model that has around 25 features s the best predcton model. Note that n practce, we do not have access to the testng set. So, we should select the best subset of features only based on the tranng set. Therefore, the goal of any best subset selecton technque s to dentfy the predcton model that s (almost) optmal for the testng set based on the exstng data set,.e., the tranng set. From Fgure 2, we observe that our proposed best subset selecton heurstc has such a desrable characterstc n practce. We see that the selected model,.e., the (yellow) square, s nearly optmal. In fact the proposed approach has selected a predcton model wth the accuracy of around 50% and 55% for AP and KP nstances, respectvely. Ths mples that the absolute dfference between the accuracy of the model selected by the proposed subset selecton approach and the accuracy of the optmal model s almost 3% and 5% for AP and KP nstances, respectvely. We note that for both classes of nstances less than 50 features exst n the model selected by the proposed approach. Overall, these results are promsng due to the fact that the (expected) probablty of randomly pckng the correct objectve functon to project s 1,.e., around 33.3% for the tr-objectve nstances. p Table 2 Accuracy and average tme decrease of testng set when usng the proposed ML technque (for the case of the complete tranng and testng sets) Tme Decrease Type Vars Accuracy ML vs. Rand Best vs. Rand ML vs. Rand Best vs. Rand 20 20 55.56% 1.29% 2.33% 55.43% 25 25 44.74% 0.67% 1.65% 40.57% AP 30 30 41.30% 0.18% 1.48% 12.20% 35 35 52.08% 0.44% 1.69% 25.81% 40 40 56.25% 1.07% 1.65% 64.99% Avg 49.50% 0.68% 1.74% 38.90% 60 63.64% 9.12% 11.03% 82.68% 70 45.65% 2.01% 10.14% 19.79% KP 80 56.10% 4.59% 11.42% 40.22% 90 58.82% 8.05% 13.27% 60.68% 100 51.43% 5.09% 10.34% 49.24% Avg 55.00% 5.67% 11.17% 50.77% 16

Error Fgure 2 0.56 0.55 0.54 0.53 0.52 0.51 0.5 0.49 0.48 0.47 0 50 100 150 200 250 300 Number of features Error 0.65 0.55 0.45 (a) Testng set of AP (b) Testng set of KP An llustraton of the performance of the proposed approach for the best subset selecton of features on the complete testng set 0.6 0.5 0.4 0 50 100 150 200 250 300 Number of features We now dscuss about the performance of the selected model n detal for each class of optmzaton problems. Table 2 summarzes our fndngs. In ths table, the column labeled Accuracy shows the average percentage of the predcton accuracy of the selected model for dfferent subclasses of nstances. Note that as mentoned n Introducton, each subclass has 200 nstances. The column labeled ML vs. Rand shows the average percentage of decrease n soluton tme when ML technque s used compared to randomly pckng an objectve functon for projectng. The column labeled Best vs. Rand shows the average percentage of decrease n soluton tme when the best objectve functon s selected for projecton compared to randomly pckng an objectve functon for projectng. Fnally, column labeled ML vs. Rand shows the percentage of ML vs. Rand to Best vs. Rand. Best vs. Rand Overall, we observe that our ML method mproves the computatonal tme n all testng sets. For AP nstances, the mprovement s around 0.68% on average whch s small. However, we should note that n the deal case, we could obtan around 1.74% mprovement n tme on average for such nstances. So, the mprovement obtaned wth the proposed ML technque s 38.9% of the deal case. For largest subclass of AP nstances, ths number s around 64.99%. For the KP nstances, the results are even more promsng snce the amount of mprovement n soluton tme s around 5.67% on average. In the deal case, we could obtan an average mprovement of 11.17% for such nstances. So, the mprovement obtaned wth the proposed ML technque s 50.77% of the deal case. 17

7.2. Complete tranng set and reduced testng set In ths secton, we test the performance of the model obtaned n Secton 7.1 on a reduced testng set. Specfcally, we remove the nstances that can be consdered as te cases,.e., those n whch the soluton tme does not change sgnfcantly (relatve to other nstances) when dfferent objectve functons are selected for projecton. To reduce the testng set we apply the followng steps: Step 1: We compute the tme range of each nstance,.e., the dfference between the best and worst soluton tmes that can be obtaned for the nstance when dfferent objectve functons are consdered for projecton. Step 2: For each subclass of nstances,.e., those wth the same number of decson varables, we compute the standard devaton and the mnmum of tme ranges n that subclass. Step 3: We elmnate an nstance,.e., consder t as a te case, f ts tme range s not greater than the sum of the mnmum and standard devaton of tme ranges n ts assocated subclass. As a result of the procedure explaned above, the testng set was reduced by 35.5% for AP nstances and by 17.5% for KP nstances. Table 3 summarzes our fndngs for the reduced testng set. Table 3 Accuracy and average tme decrease of testng set when usng the proposed ML technque (for the case of the complete tranng set and the reduced testng set) Tme Decrease Type Vars Accuracy ML vs. Rand Best vs. Rand ML vs. Rand Best vs. Rand 20 20 58.06% 1.48% 2.54% 58.10% 25 25 60.00% 0.92% 2.23% 41.32% AP 30 30 40.74% 0.12% 1.96% 6.02% 35 35 57.89% 0.54% 1.97% 27.17% 40 40 69.57% 1.64% 2.15% 76.09% Avg 56.59% 0.90% 2.16% 41.73% 60 71.05% 10.11% 11.96% 84.52% 70 45.95% 2.03% 11.48% 17.67% KP 80 57.58% 5.30% 13.48% 39.31% 90 62.96% 10.21% 15.57% 65.59% 100 60.00% 6.32% 11.33% 55.77% Avg 59.39% 6.66% 12.63% 52.74% Observe that the accuracy of the predcton models has ncreased sgnfcantly for the reduced testng set. Specfcally, t has reached to around 56.59% and 59.39% on overage for AP and KP nstances, respectvely. Snce the elmnated nstances are consdered as te 18

cases, we can assume that they are also success cases for the predcton model. So, by consderng such success cases, the predcton accuracy wll ncrease to 56.59 (1 0.355)+35.5 72% and 59.39 (1 0.175)+17.5 66.5% for AP and KP nstances, respectvely. In terms of computatonal tme, we also observe (from Table 3) an mprovement of around 0.90% and 6.66% on average for AP and KP nstances, respectvely. Ths amount of mprovement s about 41.73% and 52.74% of the deal scenaros (on average) for AP and KP nstances, respectvely. 7.3. Reduced tranng and testng sets Due to promsng results obtaned n Secton 7.2, t s natural to ask whether we can see even more mprovement f we reduce not only the testng set but also the tranng set. Therefore, n ths secton, we elmnate the te cases usng the same procedure dscussed n Secton 7.2 from both tranng and testng sets. By dong so, the sze of the tranng+testng set was reduced by 37% and 18% for AP and KP nstances, respectvely. It s evdent that due to the change n the tranng set, we need to apply our proposed approach for best subset selecton of features agan. So, smlar to Secton 7.1, Fgure 3 shows the approxmated nondomnated fronter for each class of optmzaton problems based on the reduced tranng set. By comparng the deal ponts n Fgures 1 and 3, an mmedate mprovement n the (deal) accuracy can be observed. In fact the absolute dfference between the error of the deal ponts (n these fgures) s around 12% and 7% for AP and KP nstances, respectvely. Smlar mprovements can be observed by comparng the selected approxmated nondomnated ponts n Fgures 1 and 3. Smlar to Secton 7.1, for each of the ponts (other than the deal pont) n Fgure 3, we have plotted ts correspondng pont for the testng set n Fgure 4. We agan observe that the selected model,.e., the (yellow) square, s nearly optmal for both classes of optmzaton problems. In fact the proposed approach has selected a predcton model wth the accuracy of around 52% and 62% for AP and KP nstances, respectvely. Ths mples that the absolute dfference between the accuracy of the model selected by the proposed approach and the accuracy of the optmal model s almost 5% and 3% for AP and KP nstances, respectvely. A summary of the results of ths last experment can be found n Table 4. Observe that the average predcton accuracy on the testng set for the expermental settng n ths secton,.e., reduced tranng and testng sets, has mproved sgnfcantly for KP nstances 19

Error Fgure 3 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0 50 100 150 200 250 300 Number of features Error 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0 50 100 150 200 250 300 Number of features (a) Tranng set of AP (b) Tranng set of KP An llustraton of the performance of the proposed approach for selectng the best subset of features on the reduced tranng set Error Fgure 4 0.53 0.52 0.51 0.5 0.49 0.48 0.47 0.46 0.45 0.44 0.43 0 50 100 150 200 250 300 Number of features Error 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0 50 100 150 200 250 300 Number of features (a) Testng set of AP (b) Testng set of KP An llustraton of the performance of the proposed approach for the best subset selecton of features on the reduced testng set compared to the results gven n Sectons 7.1 and 7.2. However for the nstances of AP problem, the average predcton accuracy n ths secton s only better than the one presented n Secton 7.1. Overall, the average predcton accuracy s 52.38% and 61.59% for AP and KP nstances when usng reduced tranng and testng sets. By consderng the te cases as success events, the projected accuracy ncreases up to 70% and 68.5% for AP and KP nstances, respectvely. The mportance of such an ncrease n the accuracy s hghlghted by the tme decrease percentages gven n Table 4, whch s over 1% for AP nstances and s near 8% for KP nstances. In fact, for the largest subclass of AP nstances, the average tme mprovement of 1.6% s equvalent to almost 110 seconds on average. Smlarly, for 20

Table 4 Accuracy and average tme decrease of testng set when usng the proposed ML technque (for the case of the reduced tranng and testng sets) Tme Decrease Type Vars Accuracy ML vs. Rand Best vs. Rand ML vs. Rand Best vs. Rand 20 20 58.33% 0.82% 2.14% 38.41% 25 25 52.38% 1.17% 2.39% 48.76% AP 30 30 55.17% 0.67% 1.93% 34.46% 35 35 45.45% 1.10% 2.53% 43.35% 40 40 52.63% 1.60% 2.87% 55.61% Avg 52.38% 1.03% 2.35% 43.99% 60 60.61% 6.75% 12.16% 55.51% 70 55.56% 6.79% 12.03% 56.43% KP 80 60.00% 12.14% 16.91% 71.79% 90 72.73% 9.22% 12.26% 75.17% 100 62.50% 4.70% 10.56% 44.50% Avg 61.59% 7.61% 12.64% 60.18% the largest subclass of KP nstances, the tme mprovement of 4.7% s around 490 seconds on average. 7.4. Replacng MSVM by Random Forest One man reason that we used MSVM n ths study s that (as shown n the prevous sectons) t performs well n practce for the purpose of ths study. However, another crtcal reason s the fact that MSVM creates a matrx of parameters denoted by W t n each teraton. Ths matrx has p rows where p s the number of objectve functons. In other words, for each objectve functon, MSVM creates a specfc model for predctng whch one should be used for projecton n KSA. Ths characterstc s desrable because t allowed us to develop a custom-bult b-objectve heurstc for selectng the best subset of features. Specfcally, as dscussed n Secton 6, ths characterstc s essental for dentfyng the least mportant feature n each teraton of Algorthm 1. However, applyng such a procedure on other ML technques s not trval. In lght of the above, n ths secton we replace MSVM by Random Forest wthn the proposed machne learnng framework. However, we smply use the best subset of features selected by MSVM and then feed t to Random Forest for tranng and predctng. To mplement Random Forest we use sckt-learn lbrary n Python (Pedregosa et al. 2011). Table 5 shows a comparson between the predcton accuracy of MSVM and Random Forest under three expermental settngs descrbed n Sectons 7.1-7.3. In other words, Settng 1 corresponds to the complete tranng and testng sets; Settng 2 corresponds to the complete tranng set and reduced testng sets; Fnally, Settng 3 refers to the reduced tranng and testng sets. 21

In ths table, columns labeled Increase show the average percentage of ncrease n the predcton accuracy of the Random Forest compared to MSVM. Observe from these columns that the reported numbers are mostly negatve. Ths mples that, n general, MSVM outperforms Random Forest n terms of predcton accuracy. For example, n Settng 3, we observe that the accuracy of Random Forest s around 18.63% and 22.88% worse than the accuracy of MSVM on average. Ths experment clearly shows the advantage of usng MSVM n the proposed ML framework. Table 5 A performance comparson between MSVM and Random Forest on a testng set Settng 1 Settng 2 Settng 3 Type Vars Accuracy Increase Accuracy Increase Accuracy Increase 20 20 61.11% 9.99% 64.52% 11.12% 36.36% -37.66% 25 25 42.11% -5.89% 50.00% -16.67% 26.67% -49.09% AP 30 30 45.65% 10.54% 37.04% -9.09% 52.63% -4.60% 35 35 62.50% 20.01% 65.79% 13.65% 70.59% 55.31% 40 40 50.00% -11.11% 73.91% 6.24% 50.00% -5.00% Avg 52.50% 6.06% 59.69% 5.48% 42.62% -18.63% 60 47.73% -25.00% 47.37% -33.33% 41.18% -32.06% 70 50.00% 9.53% 56.76% 23.52% 50.00% -10.01% KP 80 48.78% -13.06% 54.55% -5.27% 61.90% 3.17% 90 44.12% -25.00% 55.56% -11.76% 47.06% -35.30% 100 37.14% -27.78% 43.33% -27.78% 45.95% -26.49% Avg 46.00% -16.36% 51.52% -13.26% 47.50% -22.88% 8. Conclusons and future research We presented a mult-class support vector machne based approach to enhance exact multobjectve bnary lnear programmng algorthms. Our approach smulates the best selecton of objectve functon to be used for projecton n the KSA n order to mprove ts computatonal tme. We ntroduced a pre-orderng approach for the objectve functons n the nput fle for the purpose of standardzng the vector of features. Moreover, we ntroduced a b-objectve optmzaton approach for selectng the best subset of features n order to overcome overfttng. By conductng an extensve computatonal, we showed that reachng to the predcton accuracy of around 70% s possble for nstances of tr-objectve AP and KP. It was shown that such a predcton accuracy results n a decrease of over 12% n the computatonal tme for some nstances. Overall, we hope that the smplcty of our proposed ML technque and ts promsng results encourage more researchers to use ML technques for mprovng mult-objectve 22

optmzaton solvers. Note that, n ths paper, we studed the problem of learnng to project n a statc settng,.e., before solvng an nstance we predct the best objectve functon and use t durng the course of the search. So, one future research drecton of ths study would be fndng a way to employ the proposed learnng-to-project technque n a dynamc settng,.e., at each teraton n the search process we predct the best projected objectve and use t. Evdently, ths may result n developng new algorthms that have not yet been studed n the lterature of mult-objectve optmzaton. References Abbas M, Chaabane D (2006) Optmzng a lnear functon over an nteger effcent set. European Journal of Operatonal Research 174(2):1140 1161. Alvarez AM, Louveaux Q, Wehenkel L (2017) A machne learnng-based approxmaton of strong branchng. INFORMS Journal on Computng 29(1):185 195. Bertsmas D, Kng A, Mazumder R, et al. (2016) Best subset selecton va a modern optmzaton lens. The annals of statstcs 44(2):813 852. Boland N, Charkhgard H, Savelsbergh M (2015a) A crteron space search algorthm for bobjectve nteger programmng: The balanced box method. INFORMS Journal on Computng 27(4):735 754. Boland N, Charkhgard H, Savelsbergh M (2015b) A crteron space search algorthm for bobjectve mxed nteger programmng: The trangle splttng method. INFORMS Journal on Computng 27(4):597 618. Boland N, Charkhgard H, Savelsbergh M (2016) The L-shape search method for trobjectve nteger programmng. Mathematcal Programmng Computaton 8(2):217 251. Boland N, Charkhgard H, Savelsbergh M (2017a) A new method for optmzng a lnear functon over the effcent set of a multobjectve nteger program. European Journal of Operatonal Research 260(3):904 919. Boland N, Charkhgard H, Savelsbergh M (2017b) The quadrant shrnkng method: A smple and effcent algorthm for solvng tr-objectve nteger programs. European Journal of Operatonal Research 260(3):873 885. Bottou L (2010) Large-scale machne learnng wth stochastc gradent descent. Proceedngs of COMP- STAT 2010, 177 186 (Sprnger). Breman L (2001) Random forests. Machne learnng 45(1):5 32. Charkhgard H, Eshragh A (2019) A new approach to select the best subset of predctors n lnear regresson modelng: b-objectve mxed nteger lnear programmng. ANZIAM journal. Avalable onlne. https: //do.org/10.1017/s1446181118000275. Crammer K, Snger Y (2001) On the algorthmc mplementaton of multclass kernel-based vector machnes. Journal of machne learnng research 2(Dec):265 292. 23