Auto-WEKA: Combined Selection and Hyperparameter Optimization of Supervised Machine Learning Algorithms

Size: px

Start display at page:

Download "Auto-WEKA: Combined Selection and Hyperparameter Optimization of Supervised Machine Learning Algorithms"

Helen Gardner
5 years ago
Views:

1 Auto-WEKA: Combined Selection and Hyperparameter Optimization of Supervised Machine Learning Algorithms by Chris Thornton B.Sc, University of Calgary, 2011 a thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in the faculty of graduate and postdoctoral studies (Computer Science) The University Of British Columbia (Vancouver) March 2014 c Chris Thornton, 2014

2 Abstract Many different machine learning algorithms exist; taking into account each algorithm s set of hyperparameters, there is a staggeringly large number of possible choices. This project considers the problem of simultaneously selecting a learning algorithm and setting its hyperparameters. Previous works attack these issues separately, but this problem can be addressed by a fully automated approach, in particular by leveraging recent innovations in Bayesian optimization. The WEKA software package provides an implementation for a number of feature selection and supervised machine learning algorithms, which we use inside our automated tool, Auto-WEKA. Specifically, we examined the 3 search and 8 evaluator methods for feature selection, as well as all of the classification and regression methods, spanning 2 ensemble methods, 10 meta-methods, 27 base algorithms, and their associated hyperparameters. On 34 popular datasets from the UCI repository, the Delve repository, the KDD Cup 09, variants of the MNIST dataset and CIFAR-10, our method produces classification and regression performance often much better than obtained using state-of-the-art algorithm selection and hyperparameter optimization methods from the literature. Using this integrated approach, users can more effectively identify not only the best machine learning algorithm, but also the corresponding hyperparameter settings and feature selection methods appropriate for that algorithm, and hence achieve improved performance for their specific classification or regression task. ii

3 Preface This thesis is an expanded version of work that has been published as C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown; Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms; in Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining, pages ; ACM, I was involved in the conceptual design of Auto-WEKA, and was responsible for the development of Auto-WEKA s code. I performed all the experiments and analysis of results. In the remainder of this thesis, I adopt the first person plural in recognition of my collaborators. iii

4 Table of Contents Abstract ii Preface Table of Contents iii iv List of Tables vi List of Figures Acknowledgements ix xi 1 Introduction Supervised machine learning problems Learning algorithm selection Previous approaches to learning algorithm selection Hyperparameter optimization Previous approaches to solving hyperparameter optimization. 5 2 CASH and algorithms for solving it Baselines Model-based methods Sequential model-based algorithm configuration (SMAC) Tree-structured Parzen estimator (TPE) Iterated F-Race (I/F-Race) Auto-WEKA iv

5 4 Evaluating Auto-WEKA Experimental setup Classification results The importance of solving CASH effectively Results for training performance Results for test performance Selected methods Regression results Results for training performance Results for test performance Selected methods Other modifications of SMAC-based Auto-WEKA Immediate evaluation of all folds Multi-level cross-validation Repeated random subsampling validation (RRSV) Longer runtimes Conclusion and future work Bibliography A Method Comparison Results v

6 List of Tables Table 3.1 Learning algorithms in Auto-WEKA. indicates meta-methods, which in addition to their own parameters take one base algorithm and its parameters. + indicates ensemble methods that take as input up to 5 base algorithms and their parameters. We report the number of categorical and numeric hyperparameters for each method Table 3.2 Feature Search/Evaluator methods in Auto-WEKA. indicates search methods requiring one feature evaluator that is used to determine the importance of a feature Table 4.1 Classification datasets used. Num Categorical and Num Numeric refer to the number of categorical and numeric attributes of elements in the dataset, respectively Table 4.2 Oracle performance of Ex-Def and grid search Table 4.3 Training performance on classification datasets (Error %). Bold entries denote performance statistically insignificant from the best, according to a Welch s t test with p = Table 4.4 Test performance on classification datasets (Error %). Bold entries denote performance statistically insignificant from the best, according to a Welch s t test with p = Table 4.5 Correlation between the withheld 30% validation data and the training data performance. Gap indicates the difference between the mean training performance and mean test performance from Tables 4.3 and Table 4.6 Regression datasets used. Num Categorical and Num Numeric refer to the number of categorical and numeric attributes of elements in the dataset, respectively vi

7 Table 4.7 Training performance on regression datasets (RMSE). Bold entries denote performance statistically insignificant from the best, according to a Welch s t test with p = Table 4.8 Test performance on regression datasets (RMSE). Bold entries denote performance statistically insignificant from the best, according to a Welch s t test with p = Table 4.9 Correlation between the withheld 30% validation data and the training data performance. Gap indicates the difference between the mean training performance and mean test performance from Tables 4.7 and Table 4.10 Comparisons of mean performance obtained between the SMAC and SMAC-10-Batch variants on classification datasets. Bold entries denote performance statistically insignificant from the best, according to a Welch s t test with p = Table 4.11 Comparisons of mean performance obtained between the SMAC and SMAC-10-Batch variants on regression datasets. Bold entries denote performance statistically insignificant from the best, according to a Welch s t test with p = Table 4.12 Comparisons of mean performance obtained between the SMAC and SMAC-Multi-Level variants on classification datasets. Bold entries denote performance statistically insignificant from the best, according to a Welch s t test with p = Table 4.13 Comparisons of mean performance obtained between the SMAC and SMAC-Multi-Level variants on regression datasets. Bold entries denote performance statistically insignificant from the best, according to a Welch s t test with p = Table 4.14 Comparisons of mean performance obtained between the SMAC and SMAC-RRSV variants on classification datasets. Bold entries denote performance statistically insignificant from the best, according to a Welch s t test with p = Table 4.15 Comparisons of mean performance obtained between the SMAC and SMAC-RRSV variants on regression datasets. Bold entries denote performance statistically insignificant from the best, according to a Welch s t test with p = vii

8 Table 4.16 Comparisons of mean performance obtained between the SMAC and SMAC-Long variants on classification datasets. Bold entries denote performance statistically insignificant from the best, according to a Welch s t test with p = Table 4.17 Comparisons of mean performance obtained between the SMAC and SMAC-Long variants on regression datasets. Bold entries denote performance statistically insignificant from the best, according to a Welch s t test with p = Table A.1 Table A.2 Table A.3 Table A.4 Number of statistical significant wins on training performance of each method compared against another on classification datasets.. 63 Number of statistical significant wins on test performance of each method compared against another on classification datasets Number of statistical significant wins on training performance of each method compared against another on regression datasets Number of statistical significant wins on test performance of each method compared against another on regression datasets viii

9 List of Figures Figure 3.1 Auto-WEKA s top-level parameters. Top: is base controls Auto- WEKA s choice of either using a base algorithm or the using either a meta or ensemble learner. The triangular items represent a parameter that selects one of the 27 base algorithms and associated hyperparameters. Bottom: f eat sel controls Auto-WEKA s choice of feature selection methods Figure 3.2 Auto-WEKA s wizard interface Figure 3.3 Auto-WEKA s experiment builder workflow Figure 3.4 Auto-WEKA s interface for examining the best learning algorithm and hyperparameters after an experiment has been run Figure 4.1 Figure 4.2 Figure 4.3 Distribution of chosen classifiers aggregated across SMAC, I/F- Race and TPE Auto-WEKA variants across all the small and large datasets, ranked on their frequency of being selected. Meta-methods are marked by a suffix, ensemble methods by a + suffix Heat map of chosen classifiers aggregated across SMAC, I/F-Race and TPE Auto-WEKA variants for each dataset. A darker colour indicates the method was selected more often. Meta-methods are marked by a suffix, ensemble methods by a + suffix. Datasets are sorted by size, classifiers are ordered by methodology Left: distribution of chosen base classifiers for the two most frequently selected meta-methods: AdaBoostM1 and MultiClass classifier. Right: distribution of chosen feature search and evaluator methods. Both plots are aggregated across all Auto-WEKA variants; None indicates that no feature selection was performed ix

10 Figure 4.4 Heat map of chosen classifiers in all chosen meta-methods aggregated across SMAC, I/F-Race, and TPE Auto-WEKA variants for each dataset. A darker colour indicates the method was selected more often. Datasets are sorted by size, classifiers are ordered by methodology Figure 4.5 Distribution of chosen regression algorithms aggregated across SMAC, I/F-Race and TPE Auto-WEKA variants across all small and large datasets, ranked on their frequency of being selected. Meta-methods are marked by a suffix, ensemble methods by a + suffix Figure 4.6 Heat map of chosen regression algorithms aggregated across SMAC, I/F-Race and TPE Auto-WEKA variants for each dataset. A darker colour indicates that the method was selected more often. Metamethods are marked by a suffix, ensemble methods by a + suffix. Datasets are sorted by size, regression algorithms are ordered by methodology Figure 4.7 Left: distribution of chosen base regression algorithms for the two most frequently selected meta-methods: additive regression and bagging. Right: distribution of chosen feature search and evaluator methods. Both plots are aggregated across all Auto-WEKA variants; None indicates that no feature selection was performed Figure 4.8 Heat map of chosen regression algorithms in all chose meta-methods aggregated across SMAC, I/F-Race and TPE Auto-WEKA variants for each dataset. A darker colour indicates that the method was selected more often. Datasets are sorted by size, regression algorithms are ordered by methodology Figure 4.9 Graphical representation of the training data partitioning scheme used by SMAC-Multi-Level Figure 4.10 Trajectories of training and test performance over time for two small datasets. The vertical black line indicates the original 30 hour time budget. Shaded areas show the 10-90% quantile from the bootstrapped samples Figure 4.11 Trajectories of training and test performance over time for two large datasets. The vertical black line indicates the original 30 hour time budget. Shaded areas show the 10-90% quantile from the bootstrapped samples x

11 Acknowledgements There are many people who helped make this work happen. First, I would like to thank my supervisors, Holger Hoos and Kevin Leyton-Brown, as well as close collaborator Frank Hutter for their enduring guidance in working on this project. The members of both the β-lab and GTDT reading group have been enormously supportive (both directly and indirectly): Alexandre Fréchette, Baharak Rastegari, Chris Fawcett, David Thompson, James Wright, Sam Bayless, Steve Ramage, and Zach Drudi. Finally, thanks to my friends and family for their unceasing encouragement. xi

12 Chapter 1 Introduction An increasing variety of sophisticated feature selection and learning algorithms, complete with many hyperparameters, are currently available to the growing number of machine learning practitioners. These users require off-the-shelf solutions to their data analysis problems. The machine learning community has much aided such users by making available open source packages, such as WEKA [Hall et al., 2009] and PyBrain [Schaul et al., 2010]. Such packages require a user to make two kinds of choices: first to select a learning algorithm and secondly to customize it by setting hyperparameters, which may also control feature selection. It can be daunting to make the best choices when faced with numerous degrees of freedom. Often a user may lack in-depth understanding of the terminology and mechanics associated with each learning algorithm and its hyperparameter settings. This leads many users to select algorithms based upon reputation or intuitive appeal, often leaving hyperparameters set to their default values. Adopting such a selection approach can yield poor performance. This suggests a natural challenge for machine learning; given a dataset, automatically and simultaneously choose a learning algorithm and set its hyperparameters to optimize empirical performance. We dub this problem the combined algorithm selection and hyperparameter optimization (CASH) problem. We provide a tool, Auto-WEKA, which requires minimal input from its user and provides a solution to CASH, searching over the learning algorithms provided in the standard WEKA distribution. The CASH problem consists of two main subproblems: algorithm selection and hyperparameter optimization. The remainder of this chapter defines these subproblems, and discusses previous work by the machine learning community to individually address them. In Chapter 2, we formally define the CASH problem, discussing the small amount of attention that variants of the CASH problem have received in the literature, 1

13 as well as some possible methods to solve the CASH problem. Chapter 3 describes the design and mechanics of Auto-WEKA, our solution to an instance of the CASH problem. An in-depth empirical analysis of Auto-WEKA on 21 classification tasks and 13 regression tasks, comparing Auto-WEKA against standard baselines, is provided in Chapter 4. Future work is discussed in Chapter Supervised machine learning problems Our work focuses on supervised machine learning problems: learning a function f : X Y with X a set of features and Y either a finite set of different labels (for classification), or a subset of R (for regression). A supervised learning algorithm A maps a set {d 1,..., d n } of training data points d i = (x i, y i ) X Y to such a function. The family of functions that A can produce are often called models, while the output of A is often expressed via a vector of model parameters. The learned function can then be used on new data points x j that were not contained inside the training set, predicting the corresponding value ŷ j. Most learning algorithms A further expose hyperparameters λ from a hyperparameter space Λ, which change the way the learning algorithm A λ learns the desired function. Hyperparameters are used to indicate quantities such as a description-length penalty, the kernel width of a support vector machine, the number of neurons in a hidden layer in a neural network, and the number of data points that a leaf in a decision tree must contain to be eligible for splitting. In order to obtain a function that produces accurate predictions, both learning algorithm selection and hyperparameter optimization need to be interwoven. 1.2 Learning algorithm selection Learning algorithm selection, also called model selection, has been well studied by the machine learning community, a sample of which will be discussed in Section Given a set of learning algorithms A and a set of training data D = {(x 1, y 1 ),..., (x n, y n )}, the goal of model selection is to determine the algorithm A A with the best generalization performance. Generalization performance is estimated by splitting D into (possibly many) disjoint training and validation sets D (i) train and D(i) i = 1,..., k and then learning functions f i by applying A to D (i) train predictive performance of these functions on D (i) valid. valid for, evaluating the This allows for the learning 2

14 algorithm selection problem to be written as: A argmin A A 1 k k i=1 L(A, D (i) train, D(i) valid ), (1.1) where L(A, D (i) train, D(i) valid ) is the loss achieved by A when trained on D(i) train and evaluated on D (i) valid. For classification problems, the loss is typically defined as the rate at which the predictions have different labels than the validation data, whereas for regression problems the loss is often expressed as the root mean squared error (RMSE) Previous approaches to learning algorithm selection The simplest (and most general) approach that could be used to perform model selection is to first fix a set A of many different learning algorithms, compute an estimate of the loss function using the set of partitioned training data, before finally selecting the algorithm with the lowest loss. One of the most common techniques used for splitting the training data into pairs of training and validation sets is k-fold cross-validation, which splits the training data into k equal-sized partitions D (1) valid,..., D(k) valid, and sets D (i) train = D \ D(i) valid for i = 1,..., k. This is not the only way to partition the training data; Kohavi [1995] presents other techniques such as repeated random subsampling validation. This exhaustive approach suffers from the high computational cost of computing the estimated loss function of each algorithm, and the more philosophical hurdle of deciding what algorithms should be included in the set A. Hoeffding races [Maron and Moore, 1994] address the first of these issues: the cost of selecting amongst a number of different algorithms. The main idea in racing algorithms for model selection is to determine which candidates (the models being compared) are highly probable to be inferior to the best candidates. Once inferior candidates have been identified, there is no need to expend further effort investigating their performance. In a Hoeffding race, a schedule is chosen over the pairs of training and validation data sets uniformly at random, determining the order in which the pairs will be used for estimating the loss of each candidate. The race consists of many rounds, where at each round, the next pair of training and validation data are taken from the schedule and used to generate estimates of the loss function for each candidate. Hoeffding s bound [Hoeffding, 1963] is then used to produce an upper and lower bound on the true value of the loss function for each candidate algorithm. Any candidate that has a lower (i.e., best-case) bound that is above the best candidate s upper bound (i.e., worst-case) is eliminated. The race continues until only one candidate remains 3

15 or all the pairs of training and validation data have been used to estimate the loss. Note that the race requires an initial burn-in period to gain a reliable estimate of the loss function before removing any candidates from the race. This implies that no candidate will be eliminated for some number of rounds at the beginning of the race. Additionally, because the data is used for multiple comparisons, techniques such as Bonferroni correction need to be used to avoid statistical errors [Maron and Moore, 1994]. Meta-learning is a discipline that uses machine learning to make predictions about a dataset as a whole, rather than a particular element in the dataset [Bardenet et al., 2013, Leite et al., 2012, Pfahringer et al., 2000, Vilalta and Drissi, 2002]. One such meta-learning technique is landmarking. On a repository of many datasets, a vector of features of the dataset are computed, such as the number of categorical or numeric attributes, the number of prediction labels (only for classification), or the size of the dataset. Additionally, the loss function for a number of different learning algorithms is evaluated on each dataset in the repository. A meta-learner is then trained on these dataset features to model performance pairs, either predicting the best algorithm for a particular dataset or providing a ranking over algorithms that should be used on the dataset. Using the formalization outlined in Section 1.1, the meta-learner operates on a dataset with the x i containing the features of datasets used in supervised machine learning tasks, and the corresponding y i indicating the learning algorithm with the best performance. Landmarking suffers from the fact that even with an extensive repository of dataset features and performance methods (which requires significant computational investment), it is likely that there may be subsequent machine learning problems proposed by the user for which the meta-learner will make inaccurate predictions. Such is the pitfall of research exploration in any discipline. Note that determining what learning algorithm to use for the meta-learner is another instance of model selection, so the algorithm chosen for the meta-learner can heavily influence which methods are selected. Another consideration when performing model selection is the choice of loss function. There may be extra information inside the learning algorithm which may provide a better indication of its generalization performance to make more accurate predictions on new data. One such measure is Akaike s entropic information criterion [Bozdogan, 1987], known as AIC. AIC represents a compromise between the complexity of the learned function and the loss estimate, with the idea that functions that are less complex are more likely to generalize to new data, consistent with the principle known as Occam s razor. There are similar techniques, such as the Bayesian information 4

16 criterion [Schwarz, 1978], that provide alternate ways of computing the balance between loss and model complexity. 1.3 Hyperparameter optimization The problem of optimizing the hyperparameters λ Λ of a given learning algorithm A is conceptually similar to that of model selection. In both instances, the best performing predictive model for a given dataset is desired, but instead of selecting from many different learning algorithms the optimization considers a single algorithm s hyperparameters. The hyperparameters of a learning algorithm are often continuous, and their hyperparameter spaces are often high-dimensional. Additionally, it is possible to exploit the correlation between different hyperparameter settings λ 1, λ 2 Λ, a characteristic with no natural analogue in model selection. Given n hyperparameters λ 1,..., λ n with domains Λ 1,..., Λ n, the hyperparameter space Λ is a subset of the crossproduct of these domains: Λ Λ 1 Λ n. This subset is often strict, such as when certain settings of one hyperparameter render other hyperparameters inactive. For example, the parameters determining the specifics of the third layer of a deep belief network are not relevant if the network depth is set to one or two. Likewise, the parameters of a support vector machine s polynomial kernel are not relevant if a radial basis function kernel is used. More formally, following Hutter et al. [2009], we say that hyperparameter λ i is conditional on another hyperparameter λ j, if λ i is only active if hyperparameter λ j takes values from a given set V i (j) Λ j ; in this case, we call λ j a parent of λ i (and conversely, λ i a child of λ j ). Conditional hyperparameters can in turn be parents of other conditional hyperparameters, giving rise to a tree-structured space [Bergstra et al., 2011] or, in some cases, a directed acyclic graph (DAG) [Hutter et al., 2009]. Given such a structured space Λ, the (hierarchical) hyperparameter optimization problem can be formalized as identifying λ argmin λ Λ 1 k k i=1 L(A λ, D (i) train, D(i) valid ) Previous approaches to solving hyperparameter optimization Manual tuning of hyperparameter values has often been used in the past, since experienced users may have good intuition about which hyperparameters are likely to influence the performance of their learning algorithm most. By iteratively trying new 5

17 hyperparameter settings, a user can home in on those that perform well. However, this can be a time-consuming process and can nevertheless often result in suboptimal performance. The weaknesses of manual tuning may be particularly apparent when the user s intuition is not valid for their specific problem. Rather than relying on a user to guide the choice of hyperparameter values, grid search [Friedman et al., 2009] is one of the simplest automatic alternatives. Grid search requires that each hyperparameter λ i in the hyperparameter space be treated discretely. Each numeric hyperparameter is discretized between some minimal and maximal value, while categorical hyperparameters remain unchanged. The set of grid points is then defined to be the Cartesian product of each of the now discrete λ i. At each of these grid points, the loss function is computed for all of the pairs (folds) of training and validation data. The hyperparameter settings with the best performance over this grid are then used. Note that due to the combinatorial nature of grid search, this can be quite a computational burden if the discretization is fine or (particularly) if there are many hyperparameters. This can be partially addressed by starting out with a very coarse discretization, then refining the upper and lower bounds of hyperparameters to explore the area around the grid point with the best performance in the previous iteration [Van Gestel et al., 2004]. Grid search also suffers from the fact that often only a few hyperparameters are responsible for most of the performance of a learning algorithm. In order to prevent a combinatorial explosion of grid points, each hyperparameter is discretized into a relatively small number of values. While the total number of different hyperparameter combinations examined over the course of the grid search is often quite high, each individual hyperparameter only has a few possible values tested. This is particularly problematic as the few hyperparameters that are responsible for a large portion of the performance variation of the learning algorithm receive the same amount of attention as the hyperparameters that do not greatly affect performance. By sampling values for all hyperparameters at random, important hyperparameters will take on many different values, resulting in a more effective search of the hyperparameter space. Using this random search, Bergstra and Bengio [2012] showed that with fewer resources, the performance of selected hyperparameter values was better than both grid search and expert manual tuning. Like grid search, random search is also trivially parallelizable; by performing independent runs of the search with different random seeds on all available machines, it is easy to take advantage of large compute clusters or cloud computing to simultaneously examine many different hyperparameter values. 6

18 Evolutionary techniques have also been successfully applied to hyperparameter optimization, such as in work by Guo et al. [2008], where a particle swarm optimizer tuned the hyperparameters of a support vector machine. In work by Jin and Sendhoff [2008], evolutionary algorithms for multiobjective optimization were applied to set the hyperparameters and the complexity of the learned model. These techniques are promising, since they make few assumptions about the underlying optimization problem and are able to handle scenarios with many parameters, such as the work of Guo et al. [2008] which optimized 15 hyperparameters. If all hyperparameters are numeric and the performance of the learning algorithm is well-behaved with respect to the hyperparameters, gradient-based techniques can be used [Bengio, 2000]. The gradient information can be computed directly or empirically approximated. One of the most popular of these techniques is stochastic gradient descent (SGD, Bottou [1998]). SGD is especially appealing for cases with large amounts of data, since the partial gradient information can be computed using mini-batches of all the data, making it possible to optimize performance for datasets that cannot be loaded into memory. Like all gradient-based techniques, if the performance of the learning algorithm s loss function is convex, SGD will not end up trapped in a local minimum, resulting in optimal hyperparameter settings. Recently, techniques from Bayesian optimization have been used to search over hyperparameters: Snoek et al. [2012] used Gaussian processes and Bergstra et al. [2011] used a tree of Parzen estimators to find good hyperparameter settings. These methods have been shown to perform better than either grid or random search; in particular, Bergstra et al. [2011] were able to find hyperparameter settings for a deep belief network that surpassed the state of the art on a variant of the MNIST character recognition dataset. There also exist various techniques that optimize hyperparameters for a specific family of learning algorithms. For example, Strijov and Weber [2010] used coherent Bayesian inference to adjust the coefficients in their parametric regression procedure. The drawback of such targeted optimization approaches is that they rely heavily on the specifics of the algorithm they are optimizing, making them difficult to transfer to different learning algorithms. 7

19 Chapter 2 CASH and algorithms for solving it The combined algorithm selection and hyperparameter optimization (CASH) problem formally defines the challenge of simultaneously solving the selection of machine learning algorithms and choosing the associated hyperparameter values of a particular algorithm. Solutions to this problem have large practical importance to the machine learning community, as users seek to leverage state-of-the-art algorithms for their research. Given a set of algorithms A = {A (1),..., A (k) } with associated hyperparameter spaces Λ (1),..., Λ (k) and disjoint pairs of training and validation data D (i) D (i) valid, the goal in solving the CASH problem is to find: A λ argmin 1 A (j) A,λ Λ (j) k k i=1 train and L(A (j) λ, D(i) train, D(i) valid ). (2.1) We note that this problem can be reformulated as a single combined hierarchical hyperparameter optimization problem with parameter space Λ = Λ (1) Λ (k) {λ r }, where λ r A (1),..., A (k) is a new root-level hyperparameter that selects between algorithms A (1),..., A (k). The root-level parameters of each subspace Λ (i) are made conditional on λ r being instantiated to A i. Given the extensive literature on model selection and hyperparameter optimization and in light of the problem s practical importance, we are surprised to have found only limited variants of the CASH problem to have been studied. Furthermore, each of these variants is applicable only to a fixed and relatively small number of parameter configurations for each algorithm. For example, in the meta-learning based 8

20 work of Leite et al. [2012], a total of 292 algorithm-hyperparameter combinations were considered, spanning six different learning algorithms, while Sun and Pfahringer [2013] present another meta-learning approach that considers twenty learning learning algorithms over 466 datasets. We agree that it is very challenging to search the combined space of learning algorithms and their hyperparameters, because the space is high-dimensional, involving both categorical and continuous choice, and the response function is noisy due to the limited quantities of the validation data. Furthermore, the search space contains hierarchical dependencies; for example, the hyperparameters of a learning algorithm are only meaningful if that algorithm is chosen, or the base algorithm choices in an ensemble method are only meaningful if that particular ensemble method is chosen. The remainder of this chapter describes a number of possible procedures for solving CASH, adapting existing selection and optimization strategies from the literature. The first three methods, described in Section 2.1, are either simple approaches or are already in wide use by the machine learning community, while the last three methods, detailed in Section 2.2, all employ more complex optimization strategies. 2.1 Baselines In principle, a solution to the CASH problem may be identified in a variety of ways. Our Exhaustive-Default (Ex-Def) technique was implemented as a rudimentary approach using minimal computational resources. To use Ex-Def, the user obtains implementations of a number of different learning algorithms that are applicable to their specific learning task and dataset. Ex-Def then computes the standard k-fold cross-validation for each learning algorithm, leaving hyperparameters at their default values as set by the implementers of each learning algorithm. After these computations are completed, the learning algorithm with the best performance is selected by Ex-Def to be used on the dataset. Note that this simple selection technique is likely unable to produce optimal performance, since it does not optimize hyperparameters beyond the defaults for the particulars of the given dataset. Users with more computational resources at their disposal may employ a grid search technique, where the grid is the union of the distinct sub-grids for each of the available learning algorithms. While grid search can require extensive CPU time budgets for optimizing the hyperparameters for a single learning algorithm, this cost only increases linearly with the number of learning algorithms that is considered. Setting up such a grid search can also be labour-intensive, even using readily available research tools, such 9

21 as those found in the open source machine learning package WEKA. WEKA provides two implementations of grid search for tuning the hyperparameters of a single learning algorithm; the first can optimize any number of top-level hyperparameters, while the second can optimize any two hyperparameters, including nested ones. However, the user has to define the minimal and maximal values for each numeric hyperparameter. In order to perform a grid search to solve CASH, the user would have to prepare a number of different grid search experiments using these tools, and select amongst the best models from each of the smaller grid searches. Random search alleviates some of the drawbacks of grid search and may be applied to CASH in a straightforward way. Samples for the random search are created by simply selecting a learning algorithm at random, then randomly sampling values for each of the hyperparameters (and children of the active hyperparameters) that are associated with the chosen algorithm. As described in Section 1.3.1, random search offers several advantages over grid search. 2.2 Model-based methods A promising approach to solving CASH is model-based optimization [Zlochin et al., 2004]. This approach generates a predictive model of the underlying optimization problem and uses this model in some manner that guides the optimization process. In particular, the Bayesian approach of Sequential Model-Based Optimization (SMBO) [Hutter et al., 2011], a versatile stochastic optimization framework that can work explicitly with both categorical and continuous hyperparameters, has the ability to exploit hierarchical structure stemming from conditional parameters that are prevalent in CASH. As outlined in Algorithm 1, SMBO first builds a model M L that captures the dependence of loss function L on hyperparameter settings λ (line 1 in Algorithm 1). It then iterates the following steps: use M L to determine a promising candidate configuration of hyperparameters λ to evaluate next (line 3), evaluate the loss c of λ (line 4), and update the model M L with the new data point (λ, c) obtained (lines 5 6). In order to select the next hyperparameter configuration λ using model M L, SMBO uses a so-called acquisition function a ML : Λ R, which uses the predictive distribution of model M L at arbitrary hyperparameter configurations λ Λ to quantify (in closed form) how useful knowledge about λ would be. SMBO then simply maximizes this function over Λ to select the most promising configuration λ to evaluate next. Several well-studied acquisition functions exist [Jones et al., 10

22 Algorithm 1 SMBO Input: Algorithm A with hyperparameter space Λ, k pairs of D (i) train, D(i) valid, time budget for optimization Output: λ Λ with best performance. 1: initialise model M L ; H 2: while time budget for optimization has not been exhausted do 3: λ, ı candidate configuration and dataset pair id from M L 4: Compute c = L(A λ, D (i) train, D(i) valid ) 5: H H {(λ, c, i)} 6: Update M L based on H 7: end while 8: return λ from H with minimal c 1998, Schonlau et al., 1998, Srinivas et al., 2010]; all aim to automatically trade off exploitation (locally optimizing hyperparameters in regions known to contain good settings) versus exploration (trying hyperparameter settings in relatively unexplored regions). In this work, we maximized positive expected improvement (EI) attainable over an existing given loss value c min [Schonlau et al., 1998]; the EI is high for hyperparameter configurations with high uncertainty and good predicted performance in the model. Let c(λ) denote the loss achieved by hyperparameter configuration λ. Then, the positive improvement function over c min is defined as I cmin (λ) := max{c min c(λ), 0}. Of course, we do not know c(λ). We can, however, compute its expectation with respect to the current model M L : E ML [I cmin (λ)] = cmin max{c min c, 0} p ML (c λ) dc. (2.2) While SMBO algorithms are well suited to solving CASH, other model-based techniques are also applicable. We now review two SMBO algorithms and one general model-based optimization algorithm that are capable of handling the hierarchical hyperparameters prevalent in CASH. The first algorithm has been predominantly used for algorithm configurations, while the last two have been used before to perform hyperparameter optimization. To our knowledge, these algorithms have not previously been used to consider many different learning algorithms simultaneously. 11

23 2.2.1 Sequential model-based algorithm configuration (SMAC) Sequential model-based algorithm configuration [SMAC; Hutter et al., 2011] has been predominantly used for the task of algorithm configuration, determining the parameters of solvers for (often hard) computational problems in order to produce either higher quality solutions or faster run times for tasks such as boolean satisfiability and mixed integer programming. CASH is conceptually similar to algorithm configuration, since parameter settings for industry-standard solvers are often a mix of categorical and numeric parameters, and may include conditional parameters. SMAC supports a variety of models p(c λ) to capture the dependence of the loss function c on hyperparameters λ, including approximate Gaussian processes and random forests. In this thesis we used random forest models, since they tend to perform well with discrete and high-dimensional input data. SMAC handles conditional parameters by instantiating inactive conditional parameters in λ to default values for model training and prediction. This allows individual decision trees to include splits of the kind is hyperparameter λ i active?, allowing them to focus on active hyperparameters. SMAC obtains a predictive mean µ λ and variance σ 2 λ of p(c λ) as frequentist estimates over the predictions of its individual trees for λ; it then models p ML (c λ) as a Gaussian N (µ λ, σ 2 λ ). SMAC uses the expected improvement criterion defined in Equation 2.2, instantiating c min to the error rate of the best hyperparameter configuration measured so far. Under SMAC s predictive distribution p ML (c λ) = N (µ λ, σ 2 λ ), this expectation can be expressed in closed form as: E ML [I cmin (λ)] = σ λ [u Φ(u) + ϕ(u)], where u = c min µ λ σ λ, and ϕ and Φ denote the probability density function and cumulative distribution function of a standard normal distribution, respectively [Jones et al., 1998]. A multi-start local search procedure is used to select the next hyperparameter configurations to evaluate, using ten hyperparameter configurations already considered by SMAC with the largest EI as starting points. The local search greedily considers a set of neighbouring hyperparameter settings, where neighbours differ in one hyperparameter value, and terminates when there are no neighbours with a higher EI. An additional random hyperparameter configurations are also considered among the possible configurations to evaluate next. The EI of this combined set of hyperparameter configurations is then computed from the predictive model, and the configuration with the largest EI is selected. Note that this local search process is 12

24 computationally cheap, since it only queries the predictive model and can be further optimized since many of the predictions are relatively nearby in the hyperparameter space. SMAC was designed for robust optimization under noisy function evaluations, and as such implements special mechanisms to keep track of its best known configuration and assure high confidence in its estimate of that configuration s performance. This robustness against noisy function evaluations can be leveraged in combined algorithm selection and hyperparameter optimization, since the function to be optimized in Equation (1.1) is a mean over a set of loss terms (each corresponding to one pair of D (i) train and D(i) valid constructed from the training set). A key idea in SMAC is to make progressively better estimates of this mean by evaluating the loss terms one at a time, thereby trading off accuracy for computational cost. In order for a new configuration to become a new incumbent (the current best found so far), it must outperform the previous incumbent in every comparison made: considering only one fold, two folds, and so on, up to the total number of folds previously used to evaluate the incumbent. Furthermore, every time the incumbent survives such a comparison, it is evaluated on a new fold, up to the total number available, meaning that the number of folds used to evaluate the incumbent grows over time. This also allows for a poorly performing configuration to be removed from consideration after evaluating it on a single fold. Finally, SMAC implements a diversification mechanism to achieve robust performance even when its model is misled, and to explore new parts of the space: every other configuration is selected uniformly at random. These randomly selected points improve the accuracy of the model and will not significantly hamper SMAC s progress if it has found a high quality region of the search space. Because of the evaluation procedure just described, this requires less overhead than one might imagine Tree-structured Parzen estimator (TPE) The Tree-structured Parzen Estimator [TPE; Bergstra et al., 2011] is an optimization technique specifically designed for hyperparameter optimization. While SMAC models p(c λ) explicitly, TPE uses separate models for p(c) and p(λ c). Specifically, it models p(λ c) as one of two density estimates, conditional on whether c is greater or less than a given threshold value c : l(λ), if c < c. p(λ c) = g(λ), if c c. 13

25 Here, c is chosen as the γ-quantile of the losses TPE obtained so far (where γ is an algorithm parameter with a default value of γ = 0.15), l( ) is a density estimate learned from all previous hyperparameter settings λ with corresponding loss smaller than c, and g( ) is a density estimate learned from all previous hyperparameter settings λ with corresponding loss greater than or equal to c. Intuitively, this creates a probabilistic density estimator l( ) for hyperparameter settings that appear to do well, and a different density estimator g( ) for hyperparameter settings that appear poor with respect to the threshold. Bergstra et al. [2011] showed that the expected improvement E ML [I cmin (λ)] from Equation 2.2 is proportional to ( γ + g(λ) l(λ) (1 γ) ) 1. TPE maximizes this expression by generating many candidate hyperparameter configurations at random from l( ) and picking a λ that minimizes g(λ)/l(λ). The density estimators l( ) and g( ) have a hierarchical structure with continuous, discrete, and conditional variables reflecting the hyperparameters and their dependence relationships. For each node in this tree structure, a 1D Parzen estimator is created to model the probability density function of the node s corresponding hyperparameter. For a given hyperparameter configuration λ that is added to either l or g, only the 1D estimators corresponding to active hyperparameters in λ are updated. For continuous hyperparameters, these 1D estimators are constructed by placing density in the form of a Gaussian at each hyperparameter value λ i, with standard deviation set to the larger of each value s left and right neighbours. Discrete hyperparameters are estimated with probabilities proportional to the number of times that a particular choice occurred in the set of observations. To evaluate a candidate hyperparameter λ s probability estimate, TPE starts at the root of the tree and descends into the leaves by following paths that only use active hyperparameters. At each node in this traversal, the probability of the corresponding hyperparameter is computed according to its 1D estimator, and the individual probabilities are combined on a pass back up to the root of the tree. Note this means that TPE assumes independence for hyperparameters that do not appear together along any path from the tree s root to one of its leaves. This assumption can be problematic, since it does not account for the case that the interactions between sibling hyperparameters are responsible for performance differences. 14

26 2.2.3 Iterated F-Race (I/F-Race) Iterated F-Race [I/F-Race; Balaprakash et al., 2007] belongs to the more general family of model-based optimization algorithms, and as the name suggests, uses a racing procedure at its core. Like SMAC, I/F-Race has been primarily used for algorithm configuration tasks, such as a solver for scheduling problems [Dubois- Lacoste et al., 2011]. Candidates for the race are sampled randomly, and conditional hyperparameters are supported by sampling child hyperparameters only when their parent hyperparameter is active. I/F-Race can be used to solve CASH by treating the choice of which learning algorithm to use as a root-level hyperparameter. Recall that Hoeffding races use Hoeffding s bound in order to assess the likely performance of a racing candidate, and this bound can often be quite loose. F-Race [Birattari et al., 2002] replaces the bound with the non-parametric Friedman test [Conover, 1998] to find inferior candidates. This test considers the ranks of all the candidates for each pair of training and validation data used so far in the race, and indicates if there exists some number of candidates which tend to yield better performance than at least one other. As soon as the Friedman test detects the presence of such a difference, pairwise test statistics are computed between the candidates to eliminate the candidates with poor performance. Unlike Hoeffding races, F-Race does not use any form of multiple testing correction when comparing candidates. Note that F-Race is unable to select different learning algorithms or new values for hyperparameters once the race has begun, so the initial number of racing candidates should be quite large in order to ensure high performance. The initial candidates can be generated, for example, by either using all the points in a grid search or through random sampling. Since racing algorithms require a few iterations before they can begin to eliminate candidates, this still means that a large portion of the computational resources will be spent investigating algorithms and hyperparameter settings that are not even close to optimal. I/F-Race solves this problem by performing many rounds of a modified F-Race procedure on a more manageable number of candidates, each time randomly sampling new candidates from the space of learning algorithms and hyperparameters. The modifications from the standard F-Race procedure are in the termination conditions; the race is terminated if either the number of surviving candidates drops below a fixed threshold, the race has used at least some number of folds of the dataset, or some computational budget has been used. These thresholds are all set adaptively based on the specifics of the problem I/F-Race is optimizing. As soon as a (fixed) small number of candidates remain, the round is terminated, and the 15

27 sampling distributions are updated to be more concentrated around the algorithms and values for hyperparameters that appear to provide good performance. More specifically, in the first round of I/F-Race, all the algorithms and their hyperparameters are sampled uniformly at random. Once a round of F-Race is terminated, the surviving candidates are ranked by their performance. To generate new candidates for the next round of the race, I/F-Race first samples from the survivors of the previous round inversely proportionally to their rank (candidates with high performance are more likely to be sampled). A new candidate λ s = (λ 1,... λ d ) is then generated from the sampled survivor λ s = (λ 1,... λ d ) by setting λ i N (λ i, σ i ), where: σ i = σ i (1/N max ) 1/d In this equation, N max is the initial number of candidates used at the beginning of an iteration of I/F-Race. This approach was designed to reduce the volume of the sampled hyperparameter space at a constant rate each iteration, resulting in generating candidates in subsequent iterations that are concentrated around hyperparameter values that were successful in previous iterations. When I/F-Race finishes the final round of racing, it is possible that there are many candidates without sufficient evidence to indicate which is best. In this case, I/F-Race selects the candidate that has the best performance measured from the used pairs (folds) of training and validation data. Like TPE, I/F-Race also assumes independence between hyperparameters (therefore it will not be able to capture the interaction between sibling hyperparameters in the model), and only samples child hyperparameters when their parents are active. 16

28 Chapter 3 Auto-WEKA To demonstrate the feasibility of an automatic approach to solving the CASH problem, we built a tool, Auto-WEKA, that solves this problem for all classification and regression algorithms in combination with all feature selectors/evaluators implemented in the standard WEKA package [Hall et al., 2009]. Table 3.1 provides a list of all 39 WEKA learning algorithms. Of these methods, 27 are considered base algorithms (which can be used independently), 10 of the remaining algorithms are meta-methods (which take a single base algorithm and its parameters as an input), and the final 2 ensemble algorithms can take any number of base algorithms as input. We allowed the meta-methods to use any base algorithm with any hyperparameter settings, and allowed the 2 ensemble methods to use up to five of the 27 base algorithms, again with any hyperparameter settings. Auto-WEKA automatically determines which algorithms are applicable to each dataset, ensuring that regression algorithms are used when the predictions are numeric, and classification algorithms are used when the prediction is categorical. Additionally, Auto-WEKA avoids the use of algorithms that are incompatible with a given dataset due to issues such as missing feature values. Table 3.2 provides a list of WEKA s three feature search methods and its eight feature evaluators along with their respective numbers of hyperparameters, up to five for search and up to four for evaluators. To perform feature selection, a search method is combined with a feature evaluator, and the hyperparameters of both need to be instantiated. Feature selection is run as a preprocessing phase before the training of any learning algorithm begins. The algorithms in Table 3.1 and 3.2 have a wide variety of hyperparameters, which take values from continuous intervals, from ranges of integers, and from other 17

29 Table 3.1: Learning algorithms in Auto-WEKA. indicates meta-methods, which in addition to their own parameters take one base algorithm and its parameters. + indicates ensemble methods that take as input up to 5 base algorithms and their parameters. We report the number of categorical and numeric hyperparameters for each method. Algorithm Cat. Num. Algorithm Cat. Num. Bayes Net 2 0 C4.5 Decision Tree 6 2 Naive Bayes 2 0 Logistic Model Tree 5 2 Naive Bayes Multinomial 0 0 M5 Tree 3 1 Gaussian Process 3 6 Random Forest 2 3 Linear Regression 2 1 Random Tree 4 4 Logistic Regression 0 1 REP Tree 2 3 Single-Layer Perceptron 5 2 Stochastic Gradient Descent 3 2 Locally Weighted Learning 3 0 SVM 4 6 AdaBoostM1 2 2 Simple Linear Regression 0 0 Additive Regression 1 2 Simple Logistic Regression 2 1 Attribute Selected 2 0 Voted Perceptron 1 2 Bagging 1 2 KNN 4 1 Classification via Regression 0 0 K-Star 2 1 LogitBoost 4 4 Decision Table 4 0 MultiClass Classifier 3 0 RIPPER 3 1 Random Committee 0 1 M5 Rules 3 1 Random Subspace R 0 1 PART 2 2 Voting R 0 0 Stacking Decision Stump 0 0 discrete sets. We associated either a uniform or log-uniform prior with each numerical parameter, depending on its semantics and a brief survey of chosen values from the literature. For example, we set a log-uniform prior for the ridge regression penalty, and a uniform prior for the maximum depth for a tree in a random forest. Auto-WEKA works with continuous hyperparameter values up to the precision of the machine it is run on; nevertheless, to give a sense of the size of the space we studied, we note that discretizing hyperparameter domains to a maximum of 10 values each gives rise to over hyperparameter settings. We emphasize that this space is much larger than a simple union of the base learners hyperparameter spaces (whose size is roughly 10 8 ), since the ensemble methods allow up to 5 independent base learners, giving rise to a space with roughly (10 8 ) 5 = elements. Feature selection gives rise to another independent decision between roughly 10 6 choices, and several parameters on the ensemble and meta-level contribute another order of magnitude to the total size of AutoWEKA s hyperparameter space. Auto-WEKA can be thought of as a single learning algorithm with a highly 18

30 Table 3.2: Feature Search/Evaluator methods in Auto-WEKA. indicates search methods requiring one feature evaluator that is used to determine the importance of a feature. Feature Method Categorical Numeric Best First 1 1 Greedy Stepwise 3 2 Ranker 0 1 CFS Subset Eval 2 0 Pearson Correlation Eval 0 0 Gain Ratio Eval 0 0 Info Gain Eval R Eval 1 2 Principal Components Eval 2 2 RELIEF Eval 1 2 Symmetrical Uncertainty Eval 1 0 conditional hyperparameter space. As depicted in Figure 3.1, Auto-WEKA has two top-level Boolean parameters. The first, is base, selects among single base learning algorithms and ensemble or meta-algorithms. If is base is true, then the parameter base determines which of the 27 base-methods are to be used. If is base is false, then learner indicates either an ensemble or a meta-algorithm. If learner is a meta-algorithm, then the parameter meta base is chosen to be one of the 27 base algorithms. In the event that learner is an ensemble algorithm, an additional parameter num learners, an integer chosen from {1,..., 5}, determines the number of base algorithms to be used. base i variables are then selected according to the value of num learners, each determining which of the 27 base algorithms to use. For each base parameter, hyperparameters for all the base algorithm are attached and made conditional upon base selecting the corresponding base algorithm. Auto-WEKA s second top-level Boolean parameter feat sel determines whether to apply one of the feature selection methods. If feat sel is false, then Auto-WEKA passes the unmodified dataset to the learning algorithm. If it is true, then feat ser selects the choice of feature search method, and feat eval selects the choice of feature evaluator (with conditional hyperparameters attached). This results in a very wide tree that captures the hierarchical nature of the hyperparameters and allows the creation of a single hyperparameter optimization problem with four hierarchical layers, consisting of a total of 786 parameters for classification problems and 472 parameters for regression problems. This is because there are far fewer base algorithms that are able to make numeric predictions in WEKA than can make categorical predictions. 19

31 true feat_ser true feat_sel feat_eval false (none) Greedy Stepwise Best First CFS Subset RELIEF fwd./bkwd. conservative threshold... direction non-improving nodes lookup cache missing as separate include locally predictive num neighbours weight by distance... Figure 3.1: Auto-WEKA s top-level parameters. Top: is base controls Auto-WEKA s choice of either using a base algorithm or the using either a meta or ensemble learner. The triangular items represent a parameter that selects one of the 27 base algorithms and associated hyperparameters. Bottom: f eat sel controls Auto-WEKA s choice of feature selection methods. Since Auto-WEKA is agnostic about the choice of optimizer, we implemented variants leveraging SMAC, TPE, and I/F-Race. SMAC, TPE and I/F-Race have their own parameters influencing performance, such as TPE s choice of the γ-quantile indicating good or bad performance, the number of trees inside SMAC s random forest model, or I/F-Race s number of newly sampled candidates at each iteration. In Auto-Weka, we used the defaults for these meta-hyperparameters, as set by their respective authors. Further improvements to what may be obtainaed by optimizing the meta-hyperparameters, but a separate process with a meta-training/validation set split would be required to guard against over-fitting, and we did not attempt this due to the extreme computational cost of such experiments. All three model-based optimizers are randomized algorithms and thus produce different results based on the random seed provided. As demonstrated in work by Hutter et al. [2012], this allows for trivial, yet effective parallelization of the optimization pro- 20

32 cess via simply performing k independent runs of the optimization method in parallel and selecting the result of the run with the lowest cross-validation error. Other, more sophisticated methods for the parallelization of Bayesian optimization exist [Hutter et al., 2012, Bergstra et al., 2011, Desautels et al., 2012, Snoek et al., 2012], but to date, there is no empirical evidence that these methods outperform the simple approach we used here when the cost of evaluating hyperparameter configurations varies across the hyperparameter space. Our SMAC and TPE variants of Auto-WEKA use the simple parallelization approach, simulating runs on a standard quad-core desktop using 4 parallel jobs. The authors of I/F-Race, however, specifically designed their algorithm to run in parallel during the racing phase. As such, our I/F-Race variant of Auto-WEKA performs evaluations of candidates in parallel across 4 CPU cores. Auto-WEKA also has support for various resource constraints. When evaluating the performance of a learning algorithm on a pair of training and validation datasets, Auto-WEKA considers both memory and time limits. If the learning algorithm requests more than a user-defined threshold of RAM, Auto-WEKA aborts the training of the learning algorithm (and treats the evaluation as a failure in the optimization method). Auto-WEKA limits the time that can be used for training a learning algorithm on each pair of training and validation datasets to ensure that the optimization technique has a chance to sufficiently explore the search space. The user sets a training budget in advance, which Auto-WEKA uses to send an interrupt to the learning algorithm to finish training as soon as possible once the budget has been consumed. The learning algorithm produces a (partially) trained model in this case which is then used to generate an error estimate on the validation data. Snoek et al. [2011] presented a promising approach for using runtime predictions in the expected improvement calculation to automatically drive the search away from excessively expensive models. While we did not implement such a technique, we see it as an interesting avenue to be explored in future work. In addition to supporting large-scale experiments on many datasets simultaneously, Auto-WEKA provides a user-friendly graphical interface. The interface operates in two modes, the first acting as a wizard (Figure 3.2). In wizard mode, a user specifies their dataset and amount of computation time available. Auto-WEKA s experiment builder mode (Figure 3.3) presents additional parameter choices. The first screen accepts input of training and test data, and additionally specifies the method that Auto-WEKA will use to generate pairs of training and validation datasets. With the second screen, the user customizes the learning algorithms to be included in the search, possibly excluding algorithms that may be problematic for the dataset. The 21

final screen sets the optimizer to use and specifies the user s resource constraints. Both modes then provide a way to perform and monitor the optimization process for different random seeds.

33 final screen sets the optimizer to use and specifies the user s resource constraints. Both modes then provide a way to perform and monitor the optimization process for different random seeds. After the optimization is complete, Auto-WEKA provides a summary of the performance of the selected algorithm with its hyperparameters, and allows the user to make predictions on new data (Figure 3.4). Like WEKA, we implemented Auto-WEKA in Java, and the software works both on UNIX-based and Windows machines. Auto-WEKA and its source code are available at We are committed to ensuring that Auto-WEKA remains available to new users. Figure 3.2: Auto-WEKA s wizard interface. 22

34 Figure 3.3: Auto-WEKA s experiment builder workflow. 23

35 Figure 3.4: Auto-WEKA s interface for examining the best learning algorithm and hyperparameters after an experiment has been run. 24

Auto-WEKA: Combined Selection and Hyperparameter Optimization of Classification Algorithms

Auto-WEKA: Combined Selection and Hyperparameter Optimization of Classification Algorithms Chris Thornton Frank Hutter Holger H. Hoos Kevin Leyton-Brown Department of Computer Science, University of British