Auto-WEKA: Combined Selection and Hyperparameter Optimization of Supervised Machine Learning Algorithms

Size: px
Start display at page:

Download "Auto-WEKA: Combined Selection and Hyperparameter Optimization of Supervised Machine Learning Algorithms"

Transcription

1 Auto-WEKA: Combined Selection and Hyperparameter Optimization of Supervised Machine Learning Algorithms by Chris Thornton B.Sc, University of Calgary, 2011 a thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in the faculty of graduate and postdoctoral studies (Computer Science) The University Of British Columbia (Vancouver) March 2014 c Chris Thornton, 2014

2 Abstract Many different machine learning algorithms exist; taking into account each algorithm s set of hyperparameters, there is a staggeringly large number of possible choices. This project considers the problem of simultaneously selecting a learning algorithm and setting its hyperparameters. Previous works attack these issues separately, but this problem can be addressed by a fully automated approach, in particular by leveraging recent innovations in Bayesian optimization. The WEKA software package provides an implementation for a number of feature selection and supervised machine learning algorithms, which we use inside our automated tool, Auto-WEKA. Specifically, we examined the 3 search and 8 evaluator methods for feature selection, as well as all of the classification and regression methods, spanning 2 ensemble methods, 10 meta-methods, 27 base algorithms, and their associated hyperparameters. On 34 popular datasets from the UCI repository, the Delve repository, the KDD Cup 09, variants of the MNIST dataset and CIFAR-10, our method produces classification and regression performance often much better than obtained using state-of-the-art algorithm selection and hyperparameter optimization methods from the literature. Using this integrated approach, users can more effectively identify not only the best machine learning algorithm, but also the corresponding hyperparameter settings and feature selection methods appropriate for that algorithm, and hence achieve improved performance for their specific classification or regression task. ii

3 Preface This thesis is an expanded version of work that has been published as C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown; Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms; in Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining, pages ; ACM, I was involved in the conceptual design of Auto-WEKA, and was responsible for the development of Auto-WEKA s code. I performed all the experiments and analysis of results. In the remainder of this thesis, I adopt the first person plural in recognition of my collaborators. iii

4 Table of Contents Abstract ii Preface Table of Contents iii iv List of Tables vi List of Figures Acknowledgements ix xi 1 Introduction Supervised machine learning problems Learning algorithm selection Previous approaches to learning algorithm selection Hyperparameter optimization Previous approaches to solving hyperparameter optimization. 5 2 CASH and algorithms for solving it Baselines Model-based methods Sequential model-based algorithm configuration (SMAC) Tree-structured Parzen estimator (TPE) Iterated F-Race (I/F-Race) Auto-WEKA iv

5 4 Evaluating Auto-WEKA Experimental setup Classification results The importance of solving CASH effectively Results for training performance Results for test performance Selected methods Regression results Results for training performance Results for test performance Selected methods Other modifications of SMAC-based Auto-WEKA Immediate evaluation of all folds Multi-level cross-validation Repeated random subsampling validation (RRSV) Longer runtimes Conclusion and future work Bibliography A Method Comparison Results v

6 List of Tables Table 3.1 Learning algorithms in Auto-WEKA. indicates meta-methods, which in addition to their own parameters take one base algorithm and its parameters. + indicates ensemble methods that take as input up to 5 base algorithms and their parameters. We report the number of categorical and numeric hyperparameters for each method Table 3.2 Feature Search/Evaluator methods in Auto-WEKA. indicates search methods requiring one feature evaluator that is used to determine the importance of a feature Table 4.1 Classification datasets used. Num Categorical and Num Numeric refer to the number of categorical and numeric attributes of elements in the dataset, respectively Table 4.2 Oracle performance of Ex-Def and grid search Table 4.3 Training performance on classification datasets (Error %). Bold entries denote performance statistically insignificant from the best, according to a Welch s t test with p = Table 4.4 Test performance on classification datasets (Error %). Bold entries denote performance statistically insignificant from the best, according to a Welch s t test with p = Table 4.5 Correlation between the withheld 30% validation data and the training data performance. Gap indicates the difference between the mean training performance and mean test performance from Tables 4.3 and Table 4.6 Regression datasets used. Num Categorical and Num Numeric refer to the number of categorical and numeric attributes of elements in the dataset, respectively vi

7 Table 4.7 Training performance on regression datasets (RMSE). Bold entries denote performance statistically insignificant from the best, according to a Welch s t test with p = Table 4.8 Test performance on regression datasets (RMSE). Bold entries denote performance statistically insignificant from the best, according to a Welch s t test with p = Table 4.9 Correlation between the withheld 30% validation data and the training data performance. Gap indicates the difference between the mean training performance and mean test performance from Tables 4.7 and Table 4.10 Comparisons of mean performance obtained between the SMAC and SMAC-10-Batch variants on classification datasets. Bold entries denote performance statistically insignificant from the best, according to a Welch s t test with p = Table 4.11 Comparisons of mean performance obtained between the SMAC and SMAC-10-Batch variants on regression datasets. Bold entries denote performance statistically insignificant from the best, according to a Welch s t test with p = Table 4.12 Comparisons of mean performance obtained between the SMAC and SMAC-Multi-Level variants on classification datasets. Bold entries denote performance statistically insignificant from the best, according to a Welch s t test with p = Table 4.13 Comparisons of mean performance obtained between the SMAC and SMAC-Multi-Level variants on regression datasets. Bold entries denote performance statistically insignificant from the best, according to a Welch s t test with p = Table 4.14 Comparisons of mean performance obtained between the SMAC and SMAC-RRSV variants on classification datasets. Bold entries denote performance statistically insignificant from the best, according to a Welch s t test with p = Table 4.15 Comparisons of mean performance obtained between the SMAC and SMAC-RRSV variants on regression datasets. Bold entries denote performance statistically insignificant from the best, according to a Welch s t test with p = vii

8 Table 4.16 Comparisons of mean performance obtained between the SMAC and SMAC-Long variants on classification datasets. Bold entries denote performance statistically insignificant from the best, according to a Welch s t test with p = Table 4.17 Comparisons of mean performance obtained between the SMAC and SMAC-Long variants on regression datasets. Bold entries denote performance statistically insignificant from the best, according to a Welch s t test with p = Table A.1 Table A.2 Table A.3 Table A.4 Number of statistical significant wins on training performance of each method compared against another on classification datasets.. 63 Number of statistical significant wins on test performance of each method compared against another on classification datasets Number of statistical significant wins on training performance of each method compared against another on regression datasets Number of statistical significant wins on test performance of each method compared against another on regression datasets viii

9 List of Figures Figure 3.1 Auto-WEKA s top-level parameters. Top: is base controls Auto- WEKA s choice of either using a base algorithm or the using either a meta or ensemble learner. The triangular items represent a parameter that selects one of the 27 base algorithms and associated hyperparameters. Bottom: f eat sel controls Auto-WEKA s choice of feature selection methods Figure 3.2 Auto-WEKA s wizard interface Figure 3.3 Auto-WEKA s experiment builder workflow Figure 3.4 Auto-WEKA s interface for examining the best learning algorithm and hyperparameters after an experiment has been run Figure 4.1 Figure 4.2 Figure 4.3 Distribution of chosen classifiers aggregated across SMAC, I/F- Race and TPE Auto-WEKA variants across all the small and large datasets, ranked on their frequency of being selected. Meta-methods are marked by a suffix, ensemble methods by a + suffix Heat map of chosen classifiers aggregated across SMAC, I/F-Race and TPE Auto-WEKA variants for each dataset. A darker colour indicates the method was selected more often. Meta-methods are marked by a suffix, ensemble methods by a + suffix. Datasets are sorted by size, classifiers are ordered by methodology Left: distribution of chosen base classifiers for the two most frequently selected meta-methods: AdaBoostM1 and MultiClass classifier. Right: distribution of chosen feature search and evaluator methods. Both plots are aggregated across all Auto-WEKA variants; None indicates that no feature selection was performed ix

10 Figure 4.4 Heat map of chosen classifiers in all chosen meta-methods aggregated across SMAC, I/F-Race, and TPE Auto-WEKA variants for each dataset. A darker colour indicates the method was selected more often. Datasets are sorted by size, classifiers are ordered by methodology Figure 4.5 Distribution of chosen regression algorithms aggregated across SMAC, I/F-Race and TPE Auto-WEKA variants across all small and large datasets, ranked on their frequency of being selected. Meta-methods are marked by a suffix, ensemble methods by a + suffix Figure 4.6 Heat map of chosen regression algorithms aggregated across SMAC, I/F-Race and TPE Auto-WEKA variants for each dataset. A darker colour indicates that the method was selected more often. Metamethods are marked by a suffix, ensemble methods by a + suffix. Datasets are sorted by size, regression algorithms are ordered by methodology Figure 4.7 Left: distribution of chosen base regression algorithms for the two most frequently selected meta-methods: additive regression and bagging. Right: distribution of chosen feature search and evaluator methods. Both plots are aggregated across all Auto-WEKA variants; None indicates that no feature selection was performed Figure 4.8 Heat map of chosen regression algorithms in all chose meta-methods aggregated across SMAC, I/F-Race and TPE Auto-WEKA variants for each dataset. A darker colour indicates that the method was selected more often. Datasets are sorted by size, regression algorithms are ordered by methodology Figure 4.9 Graphical representation of the training data partitioning scheme used by SMAC-Multi-Level Figure 4.10 Trajectories of training and test performance over time for two small datasets. The vertical black line indicates the original 30 hour time budget. Shaded areas show the 10-90% quantile from the bootstrapped samples Figure 4.11 Trajectories of training and test performance over time for two large datasets. The vertical black line indicates the original 30 hour time budget. Shaded areas show the 10-90% quantile from the bootstrapped samples x

11 Acknowledgements There are many people who helped make this work happen. First, I would like to thank my supervisors, Holger Hoos and Kevin Leyton-Brown, as well as close collaborator Frank Hutter for their enduring guidance in working on this project. The members of both the β-lab and GTDT reading group have been enormously supportive (both directly and indirectly): Alexandre Fréchette, Baharak Rastegari, Chris Fawcett, David Thompson, James Wright, Sam Bayless, Steve Ramage, and Zach Drudi. Finally, thanks to my friends and family for their unceasing encouragement. xi

12 Chapter 1 Introduction An increasing variety of sophisticated feature selection and learning algorithms, complete with many hyperparameters, are currently available to the growing number of machine learning practitioners. These users require off-the-shelf solutions to their data analysis problems. The machine learning community has much aided such users by making available open source packages, such as WEKA [Hall et al., 2009] and PyBrain [Schaul et al., 2010]. Such packages require a user to make two kinds of choices: first to select a learning algorithm and secondly to customize it by setting hyperparameters, which may also control feature selection. It can be daunting to make the best choices when faced with numerous degrees of freedom. Often a user may lack in-depth understanding of the terminology and mechanics associated with each learning algorithm and its hyperparameter settings. This leads many users to select algorithms based upon reputation or intuitive appeal, often leaving hyperparameters set to their default values. Adopting such a selection approach can yield poor performance. This suggests a natural challenge for machine learning; given a dataset, automatically and simultaneously choose a learning algorithm and set its hyperparameters to optimize empirical performance. We dub this problem the combined algorithm selection and hyperparameter optimization (CASH) problem. We provide a tool, Auto-WEKA, which requires minimal input from its user and provides a solution to CASH, searching over the learning algorithms provided in the standard WEKA distribution. The CASH problem consists of two main subproblems: algorithm selection and hyperparameter optimization. The remainder of this chapter defines these subproblems, and discusses previous work by the machine learning community to individually address them. In Chapter 2, we formally define the CASH problem, discussing the small amount of attention that variants of the CASH problem have received in the literature, 1

13 as well as some possible methods to solve the CASH problem. Chapter 3 describes the design and mechanics of Auto-WEKA, our solution to an instance of the CASH problem. An in-depth empirical analysis of Auto-WEKA on 21 classification tasks and 13 regression tasks, comparing Auto-WEKA against standard baselines, is provided in Chapter 4. Future work is discussed in Chapter Supervised machine learning problems Our work focuses on supervised machine learning problems: learning a function f : X Y with X a set of features and Y either a finite set of different labels (for classification), or a subset of R (for regression). A supervised learning algorithm A maps a set {d 1,..., d n } of training data points d i = (x i, y i ) X Y to such a function. The family of functions that A can produce are often called models, while the output of A is often expressed via a vector of model parameters. The learned function can then be used on new data points x j that were not contained inside the training set, predicting the corresponding value ŷ j. Most learning algorithms A further expose hyperparameters λ from a hyperparameter space Λ, which change the way the learning algorithm A λ learns the desired function. Hyperparameters are used to indicate quantities such as a description-length penalty, the kernel width of a support vector machine, the number of neurons in a hidden layer in a neural network, and the number of data points that a leaf in a decision tree must contain to be eligible for splitting. In order to obtain a function that produces accurate predictions, both learning algorithm selection and hyperparameter optimization need to be interwoven. 1.2 Learning algorithm selection Learning algorithm selection, also called model selection, has been well studied by the machine learning community, a sample of which will be discussed in Section Given a set of learning algorithms A and a set of training data D = {(x 1, y 1 ),..., (x n, y n )}, the goal of model selection is to determine the algorithm A A with the best generalization performance. Generalization performance is estimated by splitting D into (possibly many) disjoint training and validation sets D (i) train and D(i) i = 1,..., k and then learning functions f i by applying A to D (i) train predictive performance of these functions on D (i) valid. valid for, evaluating the This allows for the learning 2

14 algorithm selection problem to be written as: A argmin A A 1 k k i=1 L(A, D (i) train, D(i) valid ), (1.1) where L(A, D (i) train, D(i) valid ) is the loss achieved by A when trained on D(i) train and evaluated on D (i) valid. For classification problems, the loss is typically defined as the rate at which the predictions have different labels than the validation data, whereas for regression problems the loss is often expressed as the root mean squared error (RMSE) Previous approaches to learning algorithm selection The simplest (and most general) approach that could be used to perform model selection is to first fix a set A of many different learning algorithms, compute an estimate of the loss function using the set of partitioned training data, before finally selecting the algorithm with the lowest loss. One of the most common techniques used for splitting the training data into pairs of training and validation sets is k-fold cross-validation, which splits the training data into k equal-sized partitions D (1) valid,..., D(k) valid, and sets D (i) train = D \ D(i) valid for i = 1,..., k. This is not the only way to partition the training data; Kohavi [1995] presents other techniques such as repeated random subsampling validation. This exhaustive approach suffers from the high computational cost of computing the estimated loss function of each algorithm, and the more philosophical hurdle of deciding what algorithms should be included in the set A. Hoeffding races [Maron and Moore, 1994] address the first of these issues: the cost of selecting amongst a number of different algorithms. The main idea in racing algorithms for model selection is to determine which candidates (the models being compared) are highly probable to be inferior to the best candidates. Once inferior candidates have been identified, there is no need to expend further effort investigating their performance. In a Hoeffding race, a schedule is chosen over the pairs of training and validation data sets uniformly at random, determining the order in which the pairs will be used for estimating the loss of each candidate. The race consists of many rounds, where at each round, the next pair of training and validation data are taken from the schedule and used to generate estimates of the loss function for each candidate. Hoeffding s bound [Hoeffding, 1963] is then used to produce an upper and lower bound on the true value of the loss function for each candidate algorithm. Any candidate that has a lower (i.e., best-case) bound that is above the best candidate s upper bound (i.e., worst-case) is eliminated. The race continues until only one candidate remains 3

15 or all the pairs of training and validation data have been used to estimate the loss. Note that the race requires an initial burn-in period to gain a reliable estimate of the loss function before removing any candidates from the race. This implies that no candidate will be eliminated for some number of rounds at the beginning of the race. Additionally, because the data is used for multiple comparisons, techniques such as Bonferroni correction need to be used to avoid statistical errors [Maron and Moore, 1994]. Meta-learning is a discipline that uses machine learning to make predictions about a dataset as a whole, rather than a particular element in the dataset [Bardenet et al., 2013, Leite et al., 2012, Pfahringer et al., 2000, Vilalta and Drissi, 2002]. One such meta-learning technique is landmarking. On a repository of many datasets, a vector of features of the dataset are computed, such as the number of categorical or numeric attributes, the number of prediction labels (only for classification), or the size of the dataset. Additionally, the loss function for a number of different learning algorithms is evaluated on each dataset in the repository. A meta-learner is then trained on these dataset features to model performance pairs, either predicting the best algorithm for a particular dataset or providing a ranking over algorithms that should be used on the dataset. Using the formalization outlined in Section 1.1, the meta-learner operates on a dataset with the x i containing the features of datasets used in supervised machine learning tasks, and the corresponding y i indicating the learning algorithm with the best performance. Landmarking suffers from the fact that even with an extensive repository of dataset features and performance methods (which requires significant computational investment), it is likely that there may be subsequent machine learning problems proposed by the user for which the meta-learner will make inaccurate predictions. Such is the pitfall of research exploration in any discipline. Note that determining what learning algorithm to use for the meta-learner is another instance of model selection, so the algorithm chosen for the meta-learner can heavily influence which methods are selected. Another consideration when performing model selection is the choice of loss function. There may be extra information inside the learning algorithm which may provide a better indication of its generalization performance to make more accurate predictions on new data. One such measure is Akaike s entropic information criterion [Bozdogan, 1987], known as AIC. AIC represents a compromise between the complexity of the learned function and the loss estimate, with the idea that functions that are less complex are more likely to generalize to new data, consistent with the principle known as Occam s razor. There are similar techniques, such as the Bayesian information 4

16 criterion [Schwarz, 1978], that provide alternate ways of computing the balance between loss and model complexity. 1.3 Hyperparameter optimization The problem of optimizing the hyperparameters λ Λ of a given learning algorithm A is conceptually similar to that of model selection. In both instances, the best performing predictive model for a given dataset is desired, but instead of selecting from many different learning algorithms the optimization considers a single algorithm s hyperparameters. The hyperparameters of a learning algorithm are often continuous, and their hyperparameter spaces are often high-dimensional. Additionally, it is possible to exploit the correlation between different hyperparameter settings λ 1, λ 2 Λ, a characteristic with no natural analogue in model selection. Given n hyperparameters λ 1,..., λ n with domains Λ 1,..., Λ n, the hyperparameter space Λ is a subset of the crossproduct of these domains: Λ Λ 1 Λ n. This subset is often strict, such as when certain settings of one hyperparameter render other hyperparameters inactive. For example, the parameters determining the specifics of the third layer of a deep belief network are not relevant if the network depth is set to one or two. Likewise, the parameters of a support vector machine s polynomial kernel are not relevant if a radial basis function kernel is used. More formally, following Hutter et al. [2009], we say that hyperparameter λ i is conditional on another hyperparameter λ j, if λ i is only active if hyperparameter λ j takes values from a given set V i (j) Λ j ; in this case, we call λ j a parent of λ i (and conversely, λ i a child of λ j ). Conditional hyperparameters can in turn be parents of other conditional hyperparameters, giving rise to a tree-structured space [Bergstra et al., 2011] or, in some cases, a directed acyclic graph (DAG) [Hutter et al., 2009]. Given such a structured space Λ, the (hierarchical) hyperparameter optimization problem can be formalized as identifying λ argmin λ Λ 1 k k i=1 L(A λ, D (i) train, D(i) valid ) Previous approaches to solving hyperparameter optimization Manual tuning of hyperparameter values has often been used in the past, since experienced users may have good intuition about which hyperparameters are likely to influence the performance of their learning algorithm most. By iteratively trying new 5

17 hyperparameter settings, a user can home in on those that perform well. However, this can be a time-consuming process and can nevertheless often result in suboptimal performance. The weaknesses of manual tuning may be particularly apparent when the user s intuition is not valid for their specific problem. Rather than relying on a user to guide the choice of hyperparameter values, grid search [Friedman et al., 2009] is one of the simplest automatic alternatives. Grid search requires that each hyperparameter λ i in the hyperparameter space be treated discretely. Each numeric hyperparameter is discretized between some minimal and maximal value, while categorical hyperparameters remain unchanged. The set of grid points is then defined to be the Cartesian product of each of the now discrete λ i. At each of these grid points, the loss function is computed for all of the pairs (folds) of training and validation data. The hyperparameter settings with the best performance over this grid are then used. Note that due to the combinatorial nature of grid search, this can be quite a computational burden if the discretization is fine or (particularly) if there are many hyperparameters. This can be partially addressed by starting out with a very coarse discretization, then refining the upper and lower bounds of hyperparameters to explore the area around the grid point with the best performance in the previous iteration [Van Gestel et al., 2004]. Grid search also suffers from the fact that often only a few hyperparameters are responsible for most of the performance of a learning algorithm. In order to prevent a combinatorial explosion of grid points, each hyperparameter is discretized into a relatively small number of values. While the total number of different hyperparameter combinations examined over the course of the grid search is often quite high, each individual hyperparameter only has a few possible values tested. This is particularly problematic as the few hyperparameters that are responsible for a large portion of the performance variation of the learning algorithm receive the same amount of attention as the hyperparameters that do not greatly affect performance. By sampling values for all hyperparameters at random, important hyperparameters will take on many different values, resulting in a more effective search of the hyperparameter space. Using this random search, Bergstra and Bengio [2012] showed that with fewer resources, the performance of selected hyperparameter values was better than both grid search and expert manual tuning. Like grid search, random search is also trivially parallelizable; by performing independent runs of the search with different random seeds on all available machines, it is easy to take advantage of large compute clusters or cloud computing to simultaneously examine many different hyperparameter values. 6

18 Evolutionary techniques have also been successfully applied to hyperparameter optimization, such as in work by Guo et al. [2008], where a particle swarm optimizer tuned the hyperparameters of a support vector machine. In work by Jin and Sendhoff [2008], evolutionary algorithms for multiobjective optimization were applied to set the hyperparameters and the complexity of the learned model. These techniques are promising, since they make few assumptions about the underlying optimization problem and are able to handle scenarios with many parameters, such as the work of Guo et al. [2008] which optimized 15 hyperparameters. If all hyperparameters are numeric and the performance of the learning algorithm is well-behaved with respect to the hyperparameters, gradient-based techniques can be used [Bengio, 2000]. The gradient information can be computed directly or empirically approximated. One of the most popular of these techniques is stochastic gradient descent (SGD, Bottou [1998]). SGD is especially appealing for cases with large amounts of data, since the partial gradient information can be computed using mini-batches of all the data, making it possible to optimize performance for datasets that cannot be loaded into memory. Like all gradient-based techniques, if the performance of the learning algorithm s loss function is convex, SGD will not end up trapped in a local minimum, resulting in optimal hyperparameter settings. Recently, techniques from Bayesian optimization have been used to search over hyperparameters: Snoek et al. [2012] used Gaussian processes and Bergstra et al. [2011] used a tree of Parzen estimators to find good hyperparameter settings. These methods have been shown to perform better than either grid or random search; in particular, Bergstra et al. [2011] were able to find hyperparameter settings for a deep belief network that surpassed the state of the art on a variant of the MNIST character recognition dataset. There also exist various techniques that optimize hyperparameters for a specific family of learning algorithms. For example, Strijov and Weber [2010] used coherent Bayesian inference to adjust the coefficients in their parametric regression procedure. The drawback of such targeted optimization approaches is that they rely heavily on the specifics of the algorithm they are optimizing, making them difficult to transfer to different learning algorithms. 7

19 Chapter 2 CASH and algorithms for solving it The combined algorithm selection and hyperparameter optimization (CASH) problem formally defines the challenge of simultaneously solving the selection of machine learning algorithms and choosing the associated hyperparameter values of a particular algorithm. Solutions to this problem have large practical importance to the machine learning community, as users seek to leverage state-of-the-art algorithms for their research. Given a set of algorithms A = {A (1),..., A (k) } with associated hyperparameter spaces Λ (1),..., Λ (k) and disjoint pairs of training and validation data D (i) D (i) valid, the goal in solving the CASH problem is to find: A λ argmin 1 A (j) A,λ Λ (j) k k i=1 train and L(A (j) λ, D(i) train, D(i) valid ). (2.1) We note that this problem can be reformulated as a single combined hierarchical hyperparameter optimization problem with parameter space Λ = Λ (1) Λ (k) {λ r }, where λ r A (1),..., A (k) is a new root-level hyperparameter that selects between algorithms A (1),..., A (k). The root-level parameters of each subspace Λ (i) are made conditional on λ r being instantiated to A i. Given the extensive literature on model selection and hyperparameter optimization and in light of the problem s practical importance, we are surprised to have found only limited variants of the CASH problem to have been studied. Furthermore, each of these variants is applicable only to a fixed and relatively small number of parameter configurations for each algorithm. For example, in the meta-learning based 8

20 work of Leite et al. [2012], a total of 292 algorithm-hyperparameter combinations were considered, spanning six different learning algorithms, while Sun and Pfahringer [2013] present another meta-learning approach that considers twenty learning learning algorithms over 466 datasets. We agree that it is very challenging to search the combined space of learning algorithms and their hyperparameters, because the space is high-dimensional, involving both categorical and continuous choice, and the response function is noisy due to the limited quantities of the validation data. Furthermore, the search space contains hierarchical dependencies; for example, the hyperparameters of a learning algorithm are only meaningful if that algorithm is chosen, or the base algorithm choices in an ensemble method are only meaningful if that particular ensemble method is chosen. The remainder of this chapter describes a number of possible procedures for solving CASH, adapting existing selection and optimization strategies from the literature. The first three methods, described in Section 2.1, are either simple approaches or are already in wide use by the machine learning community, while the last three methods, detailed in Section 2.2, all employ more complex optimization strategies. 2.1 Baselines In principle, a solution to the CASH problem may be identified in a variety of ways. Our Exhaustive-Default (Ex-Def) technique was implemented as a rudimentary approach using minimal computational resources. To use Ex-Def, the user obtains implementations of a number of different learning algorithms that are applicable to their specific learning task and dataset. Ex-Def then computes the standard k-fold cross-validation for each learning algorithm, leaving hyperparameters at their default values as set by the implementers of each learning algorithm. After these computations are completed, the learning algorithm with the best performance is selected by Ex-Def to be used on the dataset. Note that this simple selection technique is likely unable to produce optimal performance, since it does not optimize hyperparameters beyond the defaults for the particulars of the given dataset. Users with more computational resources at their disposal may employ a grid search technique, where the grid is the union of the distinct sub-grids for each of the available learning algorithms. While grid search can require extensive CPU time budgets for optimizing the hyperparameters for a single learning algorithm, this cost only increases linearly with the number of learning algorithms that is considered. Setting up such a grid search can also be labour-intensive, even using readily available research tools, such 9

21 as those found in the open source machine learning package WEKA. WEKA provides two implementations of grid search for tuning the hyperparameters of a single learning algorithm; the first can optimize any number of top-level hyperparameters, while the second can optimize any two hyperparameters, including nested ones. However, the user has to define the minimal and maximal values for each numeric hyperparameter. In order to perform a grid search to solve CASH, the user would have to prepare a number of different grid search experiments using these tools, and select amongst the best models from each of the smaller grid searches. Random search alleviates some of the drawbacks of grid search and may be applied to CASH in a straightforward way. Samples for the random search are created by simply selecting a learning algorithm at random, then randomly sampling values for each of the hyperparameters (and children of the active hyperparameters) that are associated with the chosen algorithm. As described in Section 1.3.1, random search offers several advantages over grid search. 2.2 Model-based methods A promising approach to solving CASH is model-based optimization [Zlochin et al., 2004]. This approach generates a predictive model of the underlying optimization problem and uses this model in some manner that guides the optimization process. In particular, the Bayesian approach of Sequential Model-Based Optimization (SMBO) [Hutter et al., 2011], a versatile stochastic optimization framework that can work explicitly with both categorical and continuous hyperparameters, has the ability to exploit hierarchical structure stemming from conditional parameters that are prevalent in CASH. As outlined in Algorithm 1, SMBO first builds a model M L that captures the dependence of loss function L on hyperparameter settings λ (line 1 in Algorithm 1). It then iterates the following steps: use M L to determine a promising candidate configuration of hyperparameters λ to evaluate next (line 3), evaluate the loss c of λ (line 4), and update the model M L with the new data point (λ, c) obtained (lines 5 6). In order to select the next hyperparameter configuration λ using model M L, SMBO uses a so-called acquisition function a ML : Λ R, which uses the predictive distribution of model M L at arbitrary hyperparameter configurations λ Λ to quantify (in closed form) how useful knowledge about λ would be. SMBO then simply maximizes this function over Λ to select the most promising configuration λ to evaluate next. Several well-studied acquisition functions exist [Jones et al., 10

22 Algorithm 1 SMBO Input: Algorithm A with hyperparameter space Λ, k pairs of D (i) train, D(i) valid, time budget for optimization Output: λ Λ with best performance. 1: initialise model M L ; H 2: while time budget for optimization has not been exhausted do 3: λ, ı candidate configuration and dataset pair id from M L 4: Compute c = L(A λ, D (i) train, D(i) valid ) 5: H H {(λ, c, i)} 6: Update M L based on H 7: end while 8: return λ from H with minimal c 1998, Schonlau et al., 1998, Srinivas et al., 2010]; all aim to automatically trade off exploitation (locally optimizing hyperparameters in regions known to contain good settings) versus exploration (trying hyperparameter settings in relatively unexplored regions). In this work, we maximized positive expected improvement (EI) attainable over an existing given loss value c min [Schonlau et al., 1998]; the EI is high for hyperparameter configurations with high uncertainty and good predicted performance in the model. Let c(λ) denote the loss achieved by hyperparameter configuration λ. Then, the positive improvement function over c min is defined as I cmin (λ) := max{c min c(λ), 0}. Of course, we do not know c(λ). We can, however, compute its expectation with respect to the current model M L : E ML [I cmin (λ)] = cmin max{c min c, 0} p ML (c λ) dc. (2.2) While SMBO algorithms are well suited to solving CASH, other model-based techniques are also applicable. We now review two SMBO algorithms and one general model-based optimization algorithm that are capable of handling the hierarchical hyperparameters prevalent in CASH. The first algorithm has been predominantly used for algorithm configurations, while the last two have been used before to perform hyperparameter optimization. To our knowledge, these algorithms have not previously been used to consider many different learning algorithms simultaneously. 11

23 2.2.1 Sequential model-based algorithm configuration (SMAC) Sequential model-based algorithm configuration [SMAC; Hutter et al., 2011] has been predominantly used for the task of algorithm configuration, determining the parameters of solvers for (often hard) computational problems in order to produce either higher quality solutions or faster run times for tasks such as boolean satisfiability and mixed integer programming. CASH is conceptually similar to algorithm configuration, since parameter settings for industry-standard solvers are often a mix of categorical and numeric parameters, and may include conditional parameters. SMAC supports a variety of models p(c λ) to capture the dependence of the loss function c on hyperparameters λ, including approximate Gaussian processes and random forests. In this thesis we used random forest models, since they tend to perform well with discrete and high-dimensional input data. SMAC handles conditional parameters by instantiating inactive conditional parameters in λ to default values for model training and prediction. This allows individual decision trees to include splits of the kind is hyperparameter λ i active?, allowing them to focus on active hyperparameters. SMAC obtains a predictive mean µ λ and variance σ 2 λ of p(c λ) as frequentist estimates over the predictions of its individual trees for λ; it then models p ML (c λ) as a Gaussian N (µ λ, σ 2 λ ). SMAC uses the expected improvement criterion defined in Equation 2.2, instantiating c min to the error rate of the best hyperparameter configuration measured so far. Under SMAC s predictive distribution p ML (c λ) = N (µ λ, σ 2 λ ), this expectation can be expressed in closed form as: E ML [I cmin (λ)] = σ λ [u Φ(u) + ϕ(u)], where u = c min µ λ σ λ, and ϕ and Φ denote the probability density function and cumulative distribution function of a standard normal distribution, respectively [Jones et al., 1998]. A multi-start local search procedure is used to select the next hyperparameter configurations to evaluate, using ten hyperparameter configurations already considered by SMAC with the largest EI as starting points. The local search greedily considers a set of neighbouring hyperparameter settings, where neighbours differ in one hyperparameter value, and terminates when there are no neighbours with a higher EI. An additional random hyperparameter configurations are also considered among the possible configurations to evaluate next. The EI of this combined set of hyperparameter configurations is then computed from the predictive model, and the configuration with the largest EI is selected. Note that this local search process is 12

24 computationally cheap, since it only queries the predictive model and can be further optimized since many of the predictions are relatively nearby in the hyperparameter space. SMAC was designed for robust optimization under noisy function evaluations, and as such implements special mechanisms to keep track of its best known configuration and assure high confidence in its estimate of that configuration s performance. This robustness against noisy function evaluations can be leveraged in combined algorithm selection and hyperparameter optimization, since the function to be optimized in Equation (1.1) is a mean over a set of loss terms (each corresponding to one pair of D (i) train and D(i) valid constructed from the training set). A key idea in SMAC is to make progressively better estimates of this mean by evaluating the loss terms one at a time, thereby trading off accuracy for computational cost. In order for a new configuration to become a new incumbent (the current best found so far), it must outperform the previous incumbent in every comparison made: considering only one fold, two folds, and so on, up to the total number of folds previously used to evaluate the incumbent. Furthermore, every time the incumbent survives such a comparison, it is evaluated on a new fold, up to the total number available, meaning that the number of folds used to evaluate the incumbent grows over time. This also allows for a poorly performing configuration to be removed from consideration after evaluating it on a single fold. Finally, SMAC implements a diversification mechanism to achieve robust performance even when its model is misled, and to explore new parts of the space: every other configuration is selected uniformly at random. These randomly selected points improve the accuracy of the model and will not significantly hamper SMAC s progress if it has found a high quality region of the search space. Because of the evaluation procedure just described, this requires less overhead than one might imagine Tree-structured Parzen estimator (TPE) The Tree-structured Parzen Estimator [TPE; Bergstra et al., 2011] is an optimization technique specifically designed for hyperparameter optimization. While SMAC models p(c λ) explicitly, TPE uses separate models for p(c) and p(λ c). Specifically, it models p(λ c) as one of two density estimates, conditional on whether c is greater or less than a given threshold value c : l(λ), if c < c. p(λ c) = g(λ), if c c. 13

25 Here, c is chosen as the γ-quantile of the losses TPE obtained so far (where γ is an algorithm parameter with a default value of γ = 0.15), l( ) is a density estimate learned from all previous hyperparameter settings λ with corresponding loss smaller than c, and g( ) is a density estimate learned from all previous hyperparameter settings λ with corresponding loss greater than or equal to c. Intuitively, this creates a probabilistic density estimator l( ) for hyperparameter settings that appear to do well, and a different density estimator g( ) for hyperparameter settings that appear poor with respect to the threshold. Bergstra et al. [2011] showed that the expected improvement E ML [I cmin (λ)] from Equation 2.2 is proportional to ( γ + g(λ) l(λ) (1 γ) ) 1. TPE maximizes this expression by generating many candidate hyperparameter configurations at random from l( ) and picking a λ that minimizes g(λ)/l(λ). The density estimators l( ) and g( ) have a hierarchical structure with continuous, discrete, and conditional variables reflecting the hyperparameters and their dependence relationships. For each node in this tree structure, a 1D Parzen estimator is created to model the probability density function of the node s corresponding hyperparameter. For a given hyperparameter configuration λ that is added to either l or g, only the 1D estimators corresponding to active hyperparameters in λ are updated. For continuous hyperparameters, these 1D estimators are constructed by placing density in the form of a Gaussian at each hyperparameter value λ i, with standard deviation set to the larger of each value s left and right neighbours. Discrete hyperparameters are estimated with probabilities proportional to the number of times that a particular choice occurred in the set of observations. To evaluate a candidate hyperparameter λ s probability estimate, TPE starts at the root of the tree and descends into the leaves by following paths that only use active hyperparameters. At each node in this traversal, the probability of the corresponding hyperparameter is computed according to its 1D estimator, and the individual probabilities are combined on a pass back up to the root of the tree. Note this means that TPE assumes independence for hyperparameters that do not appear together along any path from the tree s root to one of its leaves. This assumption can be problematic, since it does not account for the case that the interactions between sibling hyperparameters are responsible for performance differences. 14

26 2.2.3 Iterated F-Race (I/F-Race) Iterated F-Race [I/F-Race; Balaprakash et al., 2007] belongs to the more general family of model-based optimization algorithms, and as the name suggests, uses a racing procedure at its core. Like SMAC, I/F-Race has been primarily used for algorithm configuration tasks, such as a solver for scheduling problems [Dubois- Lacoste et al., 2011]. Candidates for the race are sampled randomly, and conditional hyperparameters are supported by sampling child hyperparameters only when their parent hyperparameter is active. I/F-Race can be used to solve CASH by treating the choice of which learning algorithm to use as a root-level hyperparameter. Recall that Hoeffding races use Hoeffding s bound in order to assess the likely performance of a racing candidate, and this bound can often be quite loose. F-Race [Birattari et al., 2002] replaces the bound with the non-parametric Friedman test [Conover, 1998] to find inferior candidates. This test considers the ranks of all the candidates for each pair of training and validation data used so far in the race, and indicates if there exists some number of candidates which tend to yield better performance than at least one other. As soon as the Friedman test detects the presence of such a difference, pairwise test statistics are computed between the candidates to eliminate the candidates with poor performance. Unlike Hoeffding races, F-Race does not use any form of multiple testing correction when comparing candidates. Note that F-Race is unable to select different learning algorithms or new values for hyperparameters once the race has begun, so the initial number of racing candidates should be quite large in order to ensure high performance. The initial candidates can be generated, for example, by either using all the points in a grid search or through random sampling. Since racing algorithms require a few iterations before they can begin to eliminate candidates, this still means that a large portion of the computational resources will be spent investigating algorithms and hyperparameter settings that are not even close to optimal. I/F-Race solves this problem by performing many rounds of a modified F-Race procedure on a more manageable number of candidates, each time randomly sampling new candidates from the space of learning algorithms and hyperparameters. The modifications from the standard F-Race procedure are in the termination conditions; the race is terminated if either the number of surviving candidates drops below a fixed threshold, the race has used at least some number of folds of the dataset, or some computational budget has been used. These thresholds are all set adaptively based on the specifics of the problem I/F-Race is optimizing. As soon as a (fixed) small number of candidates remain, the round is terminated, and the 15

27 sampling distributions are updated to be more concentrated around the algorithms and values for hyperparameters that appear to provide good performance. More specifically, in the first round of I/F-Race, all the algorithms and their hyperparameters are sampled uniformly at random. Once a round of F-Race is terminated, the surviving candidates are ranked by their performance. To generate new candidates for the next round of the race, I/F-Race first samples from the survivors of the previous round inversely proportionally to their rank (candidates with high performance are more likely to be sampled). A new candidate λ s = (λ 1,... λ d ) is then generated from the sampled survivor λ s = (λ 1,... λ d ) by setting λ i N (λ i, σ i ), where: σ i = σ i (1/N max ) 1/d In this equation, N max is the initial number of candidates used at the beginning of an iteration of I/F-Race. This approach was designed to reduce the volume of the sampled hyperparameter space at a constant rate each iteration, resulting in generating candidates in subsequent iterations that are concentrated around hyperparameter values that were successful in previous iterations. When I/F-Race finishes the final round of racing, it is possible that there are many candidates without sufficient evidence to indicate which is best. In this case, I/F-Race selects the candidate that has the best performance measured from the used pairs (folds) of training and validation data. Like TPE, I/F-Race also assumes independence between hyperparameters (therefore it will not be able to capture the interaction between sibling hyperparameters in the model), and only samples child hyperparameters when their parents are active. 16

28 Chapter 3 Auto-WEKA To demonstrate the feasibility of an automatic approach to solving the CASH problem, we built a tool, Auto-WEKA, that solves this problem for all classification and regression algorithms in combination with all feature selectors/evaluators implemented in the standard WEKA package [Hall et al., 2009]. Table 3.1 provides a list of all 39 WEKA learning algorithms. Of these methods, 27 are considered base algorithms (which can be used independently), 10 of the remaining algorithms are meta-methods (which take a single base algorithm and its parameters as an input), and the final 2 ensemble algorithms can take any number of base algorithms as input. We allowed the meta-methods to use any base algorithm with any hyperparameter settings, and allowed the 2 ensemble methods to use up to five of the 27 base algorithms, again with any hyperparameter settings. Auto-WEKA automatically determines which algorithms are applicable to each dataset, ensuring that regression algorithms are used when the predictions are numeric, and classification algorithms are used when the prediction is categorical. Additionally, Auto-WEKA avoids the use of algorithms that are incompatible with a given dataset due to issues such as missing feature values. Table 3.2 provides a list of WEKA s three feature search methods and its eight feature evaluators along with their respective numbers of hyperparameters, up to five for search and up to four for evaluators. To perform feature selection, a search method is combined with a feature evaluator, and the hyperparameters of both need to be instantiated. Feature selection is run as a preprocessing phase before the training of any learning algorithm begins. The algorithms in Table 3.1 and 3.2 have a wide variety of hyperparameters, which take values from continuous intervals, from ranges of integers, and from other 17

29 Table 3.1: Learning algorithms in Auto-WEKA. indicates meta-methods, which in addition to their own parameters take one base algorithm and its parameters. + indicates ensemble methods that take as input up to 5 base algorithms and their parameters. We report the number of categorical and numeric hyperparameters for each method. Algorithm Cat. Num. Algorithm Cat. Num. Bayes Net 2 0 C4.5 Decision Tree 6 2 Naive Bayes 2 0 Logistic Model Tree 5 2 Naive Bayes Multinomial 0 0 M5 Tree 3 1 Gaussian Process 3 6 Random Forest 2 3 Linear Regression 2 1 Random Tree 4 4 Logistic Regression 0 1 REP Tree 2 3 Single-Layer Perceptron 5 2 Stochastic Gradient Descent 3 2 Locally Weighted Learning 3 0 SVM 4 6 AdaBoostM1 2 2 Simple Linear Regression 0 0 Additive Regression 1 2 Simple Logistic Regression 2 1 Attribute Selected 2 0 Voted Perceptron 1 2 Bagging 1 2 KNN 4 1 Classification via Regression 0 0 K-Star 2 1 LogitBoost 4 4 Decision Table 4 0 MultiClass Classifier 3 0 RIPPER 3 1 Random Committee 0 1 M5 Rules 3 1 Random Subspace R 0 1 PART 2 2 Voting R 0 0 Stacking Decision Stump 0 0 discrete sets. We associated either a uniform or log-uniform prior with each numerical parameter, depending on its semantics and a brief survey of chosen values from the literature. For example, we set a log-uniform prior for the ridge regression penalty, and a uniform prior for the maximum depth for a tree in a random forest. Auto-WEKA works with continuous hyperparameter values up to the precision of the machine it is run on; nevertheless, to give a sense of the size of the space we studied, we note that discretizing hyperparameter domains to a maximum of 10 values each gives rise to over hyperparameter settings. We emphasize that this space is much larger than a simple union of the base learners hyperparameter spaces (whose size is roughly 10 8 ), since the ensemble methods allow up to 5 independent base learners, giving rise to a space with roughly (10 8 ) 5 = elements. Feature selection gives rise to another independent decision between roughly 10 6 choices, and several parameters on the ensemble and meta-level contribute another order of magnitude to the total size of AutoWEKA s hyperparameter space. Auto-WEKA can be thought of as a single learning algorithm with a highly 18

30 Table 3.2: Feature Search/Evaluator methods in Auto-WEKA. indicates search methods requiring one feature evaluator that is used to determine the importance of a feature. Feature Method Categorical Numeric Best First 1 1 Greedy Stepwise 3 2 Ranker 0 1 CFS Subset Eval 2 0 Pearson Correlation Eval 0 0 Gain Ratio Eval 0 0 Info Gain Eval R Eval 1 2 Principal Components Eval 2 2 RELIEF Eval 1 2 Symmetrical Uncertainty Eval 1 0 conditional hyperparameter space. As depicted in Figure 3.1, Auto-WEKA has two top-level Boolean parameters. The first, is base, selects among single base learning algorithms and ensemble or meta-algorithms. If is base is true, then the parameter base determines which of the 27 base-methods are to be used. If is base is false, then learner indicates either an ensemble or a meta-algorithm. If learner is a meta-algorithm, then the parameter meta base is chosen to be one of the 27 base algorithms. In the event that learner is an ensemble algorithm, an additional parameter num learners, an integer chosen from {1,..., 5}, determines the number of base algorithms to be used. base i variables are then selected according to the value of num learners, each determining which of the 27 base algorithms to use. For each base parameter, hyperparameters for all the base algorithm are attached and made conditional upon base selecting the corresponding base algorithm. Auto-WEKA s second top-level Boolean parameter feat sel determines whether to apply one of the feature selection methods. If feat sel is false, then Auto-WEKA passes the unmodified dataset to the learning algorithm. If it is true, then feat ser selects the choice of feature search method, and feat eval selects the choice of feature evaluator (with conditional hyperparameters attached). This results in a very wide tree that captures the hierarchical nature of the hyperparameters and allows the creation of a single hyperparameter optimization problem with four hierarchical layers, consisting of a total of 786 parameters for classification problems and 472 parameters for regression problems. This is because there are far fewer base algorithms that are able to make numeric predictions in WEKA than can make categorical predictions. 19

31 true feat_ser true feat_sel feat_eval false (none) Greedy Stepwise Best First CFS Subset RELIEF fwd./bkwd. conservative threshold... direction non-improving nodes lookup cache missing as separate include locally predictive num neighbours weight by distance... Figure 3.1: Auto-WEKA s top-level parameters. Top: is base controls Auto-WEKA s choice of either using a base algorithm or the using either a meta or ensemble learner. The triangular items represent a parameter that selects one of the 27 base algorithms and associated hyperparameters. Bottom: f eat sel controls Auto-WEKA s choice of feature selection methods. Since Auto-WEKA is agnostic about the choice of optimizer, we implemented variants leveraging SMAC, TPE, and I/F-Race. SMAC, TPE and I/F-Race have their own parameters influencing performance, such as TPE s choice of the γ-quantile indicating good or bad performance, the number of trees inside SMAC s random forest model, or I/F-Race s number of newly sampled candidates at each iteration. In Auto-Weka, we used the defaults for these meta-hyperparameters, as set by their respective authors. Further improvements to what may be obtainaed by optimizing the meta-hyperparameters, but a separate process with a meta-training/validation set split would be required to guard against over-fitting, and we did not attempt this due to the extreme computational cost of such experiments. All three model-based optimizers are randomized algorithms and thus produce different results based on the random seed provided. As demonstrated in work by Hutter et al. [2012], this allows for trivial, yet effective parallelization of the optimization pro- 20

32 cess via simply performing k independent runs of the optimization method in parallel and selecting the result of the run with the lowest cross-validation error. Other, more sophisticated methods for the parallelization of Bayesian optimization exist [Hutter et al., 2012, Bergstra et al., 2011, Desautels et al., 2012, Snoek et al., 2012], but to date, there is no empirical evidence that these methods outperform the simple approach we used here when the cost of evaluating hyperparameter configurations varies across the hyperparameter space. Our SMAC and TPE variants of Auto-WEKA use the simple parallelization approach, simulating runs on a standard quad-core desktop using 4 parallel jobs. The authors of I/F-Race, however, specifically designed their algorithm to run in parallel during the racing phase. As such, our I/F-Race variant of Auto-WEKA performs evaluations of candidates in parallel across 4 CPU cores. Auto-WEKA also has support for various resource constraints. When evaluating the performance of a learning algorithm on a pair of training and validation datasets, Auto-WEKA considers both memory and time limits. If the learning algorithm requests more than a user-defined threshold of RAM, Auto-WEKA aborts the training of the learning algorithm (and treats the evaluation as a failure in the optimization method). Auto-WEKA limits the time that can be used for training a learning algorithm on each pair of training and validation datasets to ensure that the optimization technique has a chance to sufficiently explore the search space. The user sets a training budget in advance, which Auto-WEKA uses to send an interrupt to the learning algorithm to finish training as soon as possible once the budget has been consumed. The learning algorithm produces a (partially) trained model in this case which is then used to generate an error estimate on the validation data. Snoek et al. [2011] presented a promising approach for using runtime predictions in the expected improvement calculation to automatically drive the search away from excessively expensive models. While we did not implement such a technique, we see it as an interesting avenue to be explored in future work. In addition to supporting large-scale experiments on many datasets simultaneously, Auto-WEKA provides a user-friendly graphical interface. The interface operates in two modes, the first acting as a wizard (Figure 3.2). In wizard mode, a user specifies their dataset and amount of computation time available. Auto-WEKA s experiment builder mode (Figure 3.3) presents additional parameter choices. The first screen accepts input of training and test data, and additionally specifies the method that Auto-WEKA will use to generate pairs of training and validation datasets. With the second screen, the user customizes the learning algorithms to be included in the search, possibly excluding algorithms that may be problematic for the dataset. The 21

33 final screen sets the optimizer to use and specifies the user s resource constraints. Both modes then provide a way to perform and monitor the optimization process for different random seeds. After the optimization is complete, Auto-WEKA provides a summary of the performance of the selected algorithm with its hyperparameters, and allows the user to make predictions on new data (Figure 3.4). Like WEKA, we implemented Auto-WEKA in Java, and the software works both on UNIX-based and Windows machines. Auto-WEKA and its source code are available at We are committed to ensuring that Auto-WEKA remains available to new users. Figure 3.2: Auto-WEKA s wizard interface. 22

34 Figure 3.3: Auto-WEKA s experiment builder workflow. 23

35 Figure 3.4: Auto-WEKA s interface for examining the best learning algorithm and hyperparameters after an experiment has been run. 24

Auto-WEKA: Combined Selection and Hyperparameter Optimization of Classification Algorithms

Auto-WEKA: Combined Selection and Hyperparameter Optimization of Classification Algorithms Auto-WEKA: Combined Selection and Hyperparameter Optimization of Classification Algorithms Chris Thornton Frank Hutter Holger H. Hoos Kevin Leyton-Brown Department of Computer Science, University of British

More information

2. Blackbox hyperparameter optimization and AutoML

2. Blackbox hyperparameter optimization and AutoML AutoML 2017. Automatic Selection, Configuration & Composition of ML Algorithms. at ECML PKDD 2017, Skopje. 2. Blackbox hyperparameter optimization and AutoML Pavel Brazdil, Frank Hutter, Holger Hoos, Joaquin

More information

Sequential Model-based Optimization for General Algorithm Configuration

Sequential Model-based Optimization for General Algorithm Configuration Sequential Model-based Optimization for General Algorithm Configuration Frank Hutter, Holger Hoos, Kevin Leyton-Brown University of British Columbia LION 5, Rome January 18, 2011 Motivation Most optimization

More information

Overview on Automatic Tuning of Hyperparameters

Overview on Automatic Tuning of Hyperparameters Overview on Automatic Tuning of Hyperparameters Alexander Fonarev http://newo.su 20.02.2016 Outline Introduction to the problem and examples Introduction to Bayesian optimization Overview of surrogate

More information

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1 Big Data Methods Chapter 5: Machine learning Big Data Methods, Chapter 5, Slide 1 5.1 Introduction to machine learning What is machine learning? Concerned with the study and development of algorithms that

More information

Allstate Insurance Claims Severity: A Machine Learning Approach

Allstate Insurance Claims Severity: A Machine Learning Approach Allstate Insurance Claims Severity: A Machine Learning Approach Rajeeva Gaur SUNet ID: rajeevag Jeff Pickelman SUNet ID: pattern Hongyi Wang SUNet ID: hongyiw I. INTRODUCTION The insurance industry has

More information

Machine Learning Techniques for Data Mining

Machine Learning Techniques for Data Mining Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already

More information

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques 24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE

More information

Automated Configuration of MIP solvers

Automated Configuration of MIP solvers Automated Configuration of MIP solvers Frank Hutter, Holger Hoos, and Kevin Leyton-Brown Department of Computer Science University of British Columbia Vancouver, Canada {hutter,hoos,kevinlb}@cs.ubc.ca

More information

CS 229 Midterm Review

CS 229 Midterm Review CS 229 Midterm Review Course Staff Fall 2018 11/2/2018 Outline Today: SVMs Kernels Tree Ensembles EM Algorithm / Mixture Models [ Focus on building intuition, less so on solving specific problems. Ask

More information

An Empirical Study of Lazy Multilabel Classification Algorithms

An Empirical Study of Lazy Multilabel Classification Algorithms An Empirical Study of Lazy Multilabel Classification Algorithms E. Spyromitros and G. Tsoumakas and I. Vlahavas Department of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

An Empirical Study of Hoeffding Racing for Model Selection in k-nearest Neighbor Classification

An Empirical Study of Hoeffding Racing for Model Selection in k-nearest Neighbor Classification An Empirical Study of Hoeffding Racing for Model Selection in k-nearest Neighbor Classification Flora Yu-Hui Yeh and Marcus Gallagher School of Information Technology and Electrical Engineering University

More information

April 3, 2012 T.C. Havens

April 3, 2012 T.C. Havens April 3, 2012 T.C. Havens Different training parameters MLP with different weights, number of layers/nodes, etc. Controls instability of classifiers (local minima) Similar strategies can be used to generate

More information

Nonparametric Methods Recap

Nonparametric Methods Recap Nonparametric Methods Recap Aarti Singh Machine Learning 10-701/15-781 Oct 4, 2010 Nonparametric Methods Kernel Density estimate (also Histogram) Weighted frequency Classification - K-NN Classifier Majority

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information

ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL

ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY BHARAT SIGINAM IN

More information

Towards efficient Bayesian Optimization for Big Data

Towards efficient Bayesian Optimization for Big Data Towards efficient Bayesian Optimization for Big Data Aaron Klein 1 Simon Bartels Stefan Falkner 1 Philipp Hennig Frank Hutter 1 1 Department of Computer Science University of Freiburg, Germany {kleinaa,sfalkner,fh}@cs.uni-freiburg.de

More information

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation Optimization Methods: Introduction and Basic concepts 1 Module 1 Lecture Notes 2 Optimization Problem and Model Formulation Introduction In the previous lecture we studied the evolution of optimization

More information

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3 Neural Network Optimization and Tuning 11-785 / Spring 2018 / Recitation 3 1 Logistics You will work through a Jupyter notebook that contains sample and starter code with explanations and comments throughout.

More information

Random Forest A. Fornaser

Random Forest A. Fornaser Random Forest A. Fornaser alberto.fornaser@unitn.it Sources Lecture 15: decision trees, information theory and random forests, Dr. Richard E. Turner Trees and Random Forests, Adele Cutler, Utah State University

More information

Ranking Clustered Data with Pairwise Comparisons

Ranking Clustered Data with Pairwise Comparisons Ranking Clustered Data with Pairwise Comparisons Alisa Maas ajmaas@cs.wisc.edu 1. INTRODUCTION 1.1 Background Machine learning often relies heavily on being able to rank the relative fitness of instances

More information

Boosting Algorithms for Parallel and Distributed Learning

Boosting Algorithms for Parallel and Distributed Learning Distributed and Parallel Databases, 11, 203 229, 2002 c 2002 Kluwer Academic Publishers. Manufactured in The Netherlands. Boosting Algorithms for Parallel and Distributed Learning ALEKSANDAR LAZAREVIC

More information

DI TRANSFORM. The regressive analyses. identify relationships

DI TRANSFORM. The regressive analyses. identify relationships July 2, 2015 DI TRANSFORM MVstats TM Algorithm Overview Summary The DI Transform Multivariate Statistics (MVstats TM ) package includes five algorithm options that operate on most types of geologic, geophysical,

More information

Unsupervised Learning

Unsupervised Learning Networks for Pattern Recognition, 2014 Networks for Single Linkage K-Means Soft DBSCAN PCA Networks for Kohonen Maps Linear Vector Quantization Networks for Problems/Approaches in Machine Learning Supervised

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:

More information

Performance Prediction and Automated Tuning of Randomized and Parametric Algorithms

Performance Prediction and Automated Tuning of Randomized and Parametric Algorithms Performance Prediction and Automated Tuning of Randomized and Parametric Algorithms Frank Hutter 1, Youssef Hamadi 2, Holger Hoos 1, and Kevin Leyton-Brown 1 1 University of British Columbia, Vancouver,

More information

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011 Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011 1. Introduction Reddit is one of the most popular online social news websites with millions

More information

Supervised Learning for Image Segmentation

Supervised Learning for Image Segmentation Supervised Learning for Image Segmentation Raphael Meier 06.10.2016 Raphael Meier MIA 2016 06.10.2016 1 / 52 References A. Ng, Machine Learning lecture, Stanford University. A. Criminisi, J. Shotton, E.

More information

Particle Swarm Optimization applied to Pattern Recognition

Particle Swarm Optimization applied to Pattern Recognition Particle Swarm Optimization applied to Pattern Recognition by Abel Mengistu Advisor: Dr. Raheel Ahmad CS Senior Research 2011 Manchester College May, 2011-1 - Table of Contents Introduction... - 3 - Objectives...

More information

An Empirical Study of Per-Instance Algorithm Scheduling

An Empirical Study of Per-Instance Algorithm Scheduling An Empirical Study of Per-Instance Algorithm Scheduling Marius Lindauer, Rolf-David Bergdoll, and Frank Hutter University of Freiburg Abstract. Algorithm selection is a prominent approach to improve a

More information

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA Predictive Analytics: Demystifying Current and Emerging Methodologies Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA May 18, 2017 About the Presenters Tom Kolde, FCAS, MAAA Consulting Actuary Chicago,

More information

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane DATA MINING AND MACHINE LEARNING Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Data preprocessing Feature normalization Missing

More information

Machine Learning: An Applied Econometric Approach Online Appendix

Machine Learning: An Applied Econometric Approach Online Appendix Machine Learning: An Applied Econometric Approach Online Appendix Sendhil Mullainathan mullain@fas.harvard.edu Jann Spiess jspiess@fas.harvard.edu April 2017 A How We Predict In this section, we detail

More information

Hybrid Feature Selection for Modeling Intrusion Detection Systems

Hybrid Feature Selection for Modeling Intrusion Detection Systems Hybrid Feature Selection for Modeling Intrusion Detection Systems Srilatha Chebrolu, Ajith Abraham and Johnson P Thomas Department of Computer Science, Oklahoma State University, USA ajith.abraham@ieee.org,

More information

Supplementary material for: BO-HB: Robust and Efficient Hyperparameter Optimization at Scale

Supplementary material for: BO-HB: Robust and Efficient Hyperparameter Optimization at Scale Supplementary material for: BO-: Robust and Efficient Hyperparameter Optimization at Scale Stefan Falkner 1 Aaron Klein 1 Frank Hutter 1 A. Available Software To promote reproducible science and enable

More information

CS489/698: Intro to ML

CS489/698: Intro to ML CS489/698: Intro to ML Lecture 14: Training of Deep NNs Instructor: Sun Sun 1 Outline Activation functions Regularization Gradient-based optimization 2 Examples of activation functions 3 5/28/18 Sun Sun

More information

A Brief Look at Optimization

A Brief Look at Optimization A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year s version Overview Introduction Classes of optimization problems Linear programming Steepest

More information

Efficient Hyperparameter Optimization of Deep Learning Algorithms Using Deterministic RBF Surrogates

Efficient Hyperparameter Optimization of Deep Learning Algorithms Using Deterministic RBF Surrogates Efficient Hyperparameter Optimization of Deep Learning Algorithms Using Deterministic RBF Surrogates Ilija Ilievski Graduate School for Integrative Sciences and Engineering National University of Singapore

More information

Outlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY Keynote, Outlier Detection and Description Workshop, 2013

Outlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY Keynote, Outlier Detection and Description Workshop, 2013 Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598 Outlier Ensembles Keynote, Outlier Detection and Description Workshop, 2013 Based on the ACM SIGKDD Explorations Position Paper: Outlier

More information

The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand).

The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand). http://waikato.researchgateway.ac.nz/ Research Commons at the University of Waikato Copyright Statement: The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand). The thesis

More information

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,

More information

Clustering. Chapter 10 in Introduction to statistical learning

Clustering. Chapter 10 in Introduction to statistical learning Clustering Chapter 10 in Introduction to statistical learning 16 14 12 10 8 6 4 2 0 2 4 6 8 10 12 14 1 Clustering ² Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 1990). ² What

More information

CS229 Final Project: Predicting Expected Response Times

CS229 Final Project: Predicting Expected  Response Times CS229 Final Project: Predicting Expected Email Response Times Laura Cruz-Albrecht (lcruzalb), Kevin Khieu (kkhieu) December 15, 2017 1 Introduction Each day, countless emails are sent out, yet the time

More information

Contents. Preface to the Second Edition

Contents. Preface to the Second Edition Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................

More information

Evaluating Classifiers

Evaluating Classifiers Evaluating Classifiers Charles Elkan elkan@cs.ucsd.edu January 18, 2011 In a real-world application of supervised learning, we have a training set of examples with labels, and a test set of examples with

More information

Recent advances in Metamodel of Optimal Prognosis. Lectures. Thomas Most & Johannes Will

Recent advances in Metamodel of Optimal Prognosis. Lectures. Thomas Most & Johannes Will Lectures Recent advances in Metamodel of Optimal Prognosis Thomas Most & Johannes Will presented at the Weimar Optimization and Stochastic Days 2010 Source: www.dynardo.de/en/library Recent advances in

More information

Comparing Dropout Nets to Sum-Product Networks for Predicting Molecular Activity

Comparing Dropout Nets to Sum-Product Networks for Predicting Molecular Activity 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Performance Estimation and Regularization Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Bias- Variance Tradeoff Fundamental to machine learning approaches Bias- Variance Tradeoff Error due to Bias:

More information

9.1. K-means Clustering

9.1. K-means Clustering 424 9. MIXTURE MODELS AND EM Section 9.2 Section 9.3 Section 9.4 view of mixture distributions in which the discrete latent variables can be interpreted as defining assignments of data points to specific

More information

PSU Student Research Symposium 2017 Bayesian Optimization for Refining Object Proposals, with an Application to Pedestrian Detection Anthony D.

PSU Student Research Symposium 2017 Bayesian Optimization for Refining Object Proposals, with an Application to Pedestrian Detection Anthony D. PSU Student Research Symposium 2017 Bayesian Optimization for Refining Object Proposals, with an Application to Pedestrian Detection Anthony D. Rhodes 5/10/17 What is Machine Learning? Machine learning

More information

Semi-supervised learning and active learning

Semi-supervised learning and active learning Semi-supervised learning and active learning Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Combining classifiers Ensemble learning: a machine learning paradigm where multiple learners

More information

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu FMA901F: Machine Learning Lecture 3: Linear Models for Regression Cristian Sminchisescu Machine Learning: Frequentist vs. Bayesian In the frequentist setting, we seek a fixed parameter (vector), with value(s)

More information

Efficient Hyper-parameter Optimization for NLP Applications

Efficient Hyper-parameter Optimization for NLP Applications Efficient Hyper-parameter Optimization for NLP Applications Lidan Wang 1, Minwei Feng 1, Bowen Zhou 1, Bing Xiang 1, Sridhar Mahadevan 2,1 1 IBM Watson, T. J. Watson Research Center, NY 2 College of Information

More information

Building Classifiers using Bayesian Networks

Building Classifiers using Bayesian Networks Building Classifiers using Bayesian Networks Nir Friedman and Moises Goldszmidt 1997 Presented by Brian Collins and Lukas Seitlinger Paper Summary The Naive Bayes classifier has reasonable performance

More information

University of California, Berkeley

University of California, Berkeley University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2014 Paper 321 A Scalable Supervised Subsemble Prediction Algorithm Stephanie Sapp Mark J. van der Laan

More information

Perceptron: This is convolution!

Perceptron: This is convolution! Perceptron: This is convolution! v v v Shared weights v Filter = local perceptron. Also called kernel. By pooling responses at different locations, we gain robustness to the exact spatial location of image

More information

Fast or furious? - User analysis of SF Express Inc

Fast or furious? - User analysis of SF Express Inc CS 229 PROJECT, DEC. 2017 1 Fast or furious? - User analysis of SF Express Inc Gege Wen@gegewen, Yiyuan Zhang@yiyuan12, Kezhen Zhao@zkz I. MOTIVATION The motivation of this project is to predict the likelihood

More information

A Modular Multiphase Heuristic Solver for Post Enrolment Course Timetabling

A Modular Multiphase Heuristic Solver for Post Enrolment Course Timetabling A Modular Multiphase Heuristic Solver for Post Enrolment Course Timetabling Marco Chiarandini 1, Chris Fawcett 2, and Holger H. Hoos 2,3 1 University of Southern Denmark, Department of Mathematics and

More information

Assignment 5: Collaborative Filtering

Assignment 5: Collaborative Filtering Assignment 5: Collaborative Filtering Arash Vahdat Fall 2015 Readings You are highly recommended to check the following readings before/while doing this assignment: Slope One Algorithm: https://en.wikipedia.org/wiki/slope_one.

More information

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used. 1 4.12 Generalization In back-propagation learning, as many training examples as possible are typically used. It is hoped that the network so designed generalizes well. A network generalizes well when

More information

Modeling and Monitoring Crop Disease in Developing Countries

Modeling and Monitoring Crop Disease in Developing Countries Modeling and Monitoring Crop Disease in Developing Countries John Quinn 1, Kevin Leyton-Brown 2, Ernest Mwebaze 1 1 Department of Computer Science 2 Department of Computer Science Makerere University,

More information

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank Implementation: Real machine learning schemes Decision trees Classification

More information

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures: Homework Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression 3.0-3.2 Pod-cast lecture on-line Next lectures: I posted a rough plan. It is flexible though so please come with suggestions Bayes

More information

Exploring Econometric Model Selection Using Sensitivity Analysis

Exploring Econometric Model Selection Using Sensitivity Analysis Exploring Econometric Model Selection Using Sensitivity Analysis William Becker Paolo Paruolo Andrea Saltelli Nice, 2 nd July 2013 Outline What is the problem we are addressing? Past approaches Hoover

More information

CS229 Lecture notes. Raphael John Lamarre Townshend

CS229 Lecture notes. Raphael John Lamarre Townshend CS229 Lecture notes Raphael John Lamarre Townshend Decision Trees We now turn our attention to decision trees, a simple yet flexible class of algorithms. We will first consider the non-linear, region-based

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique Anotai Siltepavet 1, Sukree Sinthupinyo 2 and Prabhas Chongstitvatana 3 1 Computer Engineering, Chulalongkorn University,

More information

Regularization and model selection

Regularization and model selection CS229 Lecture notes Andrew Ng Part VI Regularization and model selection Suppose we are trying select among several different models for a learning problem. For instance, we might be using a polynomial

More information

ECG782: Multidimensional Digital Signal Processing

ECG782: Multidimensional Digital Signal Processing ECG782: Multidimensional Digital Signal Processing Object Recognition http://www.ee.unlv.edu/~b1morris/ecg782/ 2 Outline Knowledge Representation Statistical Pattern Recognition Neural Networks Boosting

More information

1) Give decision trees to represent the following Boolean functions:

1) Give decision trees to represent the following Boolean functions: 1) Give decision trees to represent the following Boolean functions: 1) A B 2) A [B C] 3) A XOR B 4) [A B] [C Dl Answer: 1) A B 2) A [B C] 1 3) A XOR B = (A B) ( A B) 4) [A B] [C D] 2 2) Consider the following

More information

Learning to Transfer Initializations for Bayesian Hyperparameter Optimization

Learning to Transfer Initializations for Bayesian Hyperparameter Optimization Learning to Transfer Initializations for Bayesian Hyperparameter Optimization Jungtaek Kim, Saehoon Kim, and Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and

More information

Random Search Report An objective look at random search performance for 4 problem sets

Random Search Report An objective look at random search performance for 4 problem sets Random Search Report An objective look at random search performance for 4 problem sets Dudon Wai Georgia Institute of Technology CS 7641: Machine Learning Atlanta, GA dwai3@gatech.edu Abstract: This report

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

Hyperparameter optimization. CS6787 Lecture 6 Fall 2017

Hyperparameter optimization. CS6787 Lecture 6 Fall 2017 Hyperparameter optimization CS6787 Lecture 6 Fall 2017 Review We ve covered many methods Stochastic gradient descent Step size/learning rate, how long to run Mini-batching Batch size Momentum Momentum

More information

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning Partitioning Data IRDS: Evaluation, Debugging, and Diagnostics Charles Sutton University of Edinburgh Training Validation Test Training : Running learning algorithms Validation : Tuning parameters of learning

More information

Supervised vs. Unsupervised Learning

Supervised vs. Unsupervised Learning Clustering Supervised vs. Unsupervised Learning So far we have assumed that the training samples used to design the classifier were labeled by their class membership (supervised learning) We assume now

More information

ORT EP R RCH A ESE R P A IDI! " #$$% &' (# $!"

ORT EP R RCH A ESE R P A IDI!  #$$% &' (# $! R E S E A R C H R E P O R T IDIAP A Parallel Mixture of SVMs for Very Large Scale Problems Ronan Collobert a b Yoshua Bengio b IDIAP RR 01-12 April 26, 2002 Samy Bengio a published in Neural Computation,

More information

Facial Expression Classification with Random Filters Feature Extraction

Facial Expression Classification with Random Filters Feature Extraction Facial Expression Classification with Random Filters Feature Extraction Mengye Ren Facial Monkey mren@cs.toronto.edu Zhi Hao Luo It s Me lzh@cs.toronto.edu I. ABSTRACT In our work, we attempted to tackle

More information

Machine Learning / Jan 27, 2010

Machine Learning / Jan 27, 2010 Revisiting Logistic Regression & Naïve Bayes Aarti Singh Machine Learning 10-701/15-781 Jan 27, 2010 Generative and Discriminative Classifiers Training classifiers involves learning a mapping f: X -> Y,

More information

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. CS 189 Spring 2015 Introduction to Machine Learning Final You have 2 hours 50 minutes for the exam. The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. No calculators or

More information

Improving Classification Accuracy for Single-loop Reliability-based Design Optimization

Improving Classification Accuracy for Single-loop Reliability-based Design Optimization , March 15-17, 2017, Hong Kong Improving Classification Accuracy for Single-loop Reliability-based Design Optimization I-Tung Yang, and Willy Husada Abstract Reliability-based design optimization (RBDO)

More information

Based on Raymond J. Mooney s slides

Based on Raymond J. Mooney s slides Instance Based Learning Based on Raymond J. Mooney s slides University of Texas at Austin 1 Example 2 Instance-Based Learning Unlike other learning algorithms, does not involve construction of an explicit

More information

A study of classification algorithms using Rapidminer

A study of classification algorithms using Rapidminer Volume 119 No. 12 2018, 15977-15988 ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu A study of classification algorithms using Rapidminer Dr.J.Arunadevi 1, S.Ramya 2, M.Ramesh Raja

More information

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

COMP 551 Applied Machine Learning Lecture 16: Deep Learning COMP 551 Applied Machine Learning Lecture 16: Deep Learning Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted, all

More information

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer Model Assessment and Selection Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Model Training data Testing data Model Testing error rate Training error

More information

Challenges motivating deep learning. Sargur N. Srihari

Challenges motivating deep learning. Sargur N. Srihari Challenges motivating deep learning Sargur N. srihari@cedar.buffalo.edu 1 Topics In Machine Learning Basics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation

More information

The Curse of Dimensionality

The Curse of Dimensionality The Curse of Dimensionality ACAS 2002 p1/66 Curse of Dimensionality The basic idea of the curse of dimensionality is that high dimensional data is difficult to work with for several reasons: Adding more

More information

Cover Page. The handle holds various files of this Leiden University dissertation.

Cover Page. The handle   holds various files of this Leiden University dissertation. Cover Page The handle http://hdl.handle.net/1887/22055 holds various files of this Leiden University dissertation. Author: Koch, Patrick Title: Efficient tuning in supervised machine learning Issue Date:

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

Hyperparameters and Validation Sets. Sargur N. Srihari

Hyperparameters and Validation Sets. Sargur N. Srihari Hyperparameters and Validation Sets Sargur N. srihari@cedar.buffalo.edu 1 Topics in Machine Learning Basics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation

More information

Random projection for non-gaussian mixture models

Random projection for non-gaussian mixture models Random projection for non-gaussian mixture models Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract Recently,

More information

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10) CMPUT 391 Database Management Systems Data Mining Textbook: Chapter 17.7-17.11 (without 17.10) University of Alberta 1 Overview Motivation KDD and Data Mining Association Rules Clustering Classification

More information

Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning

Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning Devina Desai ddevina1@csee.umbc.edu Tim Oates oates@csee.umbc.edu Vishal Shanbhag vshan1@csee.umbc.edu Machine Learning

More information

Chapter 5 Efficient Memory Information Retrieval

Chapter 5 Efficient Memory Information Retrieval Chapter 5 Efficient Memory Information Retrieval In this chapter, we will talk about two topics: (1) What is a kd-tree? (2) How can we use kdtrees to speed up the memory-based learning algorithms? Since

More information

Well Analysis: Program psvm_welllogs

Well Analysis: Program psvm_welllogs Proximal Support Vector Machine Classification on Well Logs Overview Support vector machine (SVM) is a recent supervised machine learning technique that is widely used in text detection, image recognition

More information

Louis Fourrier Fabien Gaie Thomas Rolf

Louis Fourrier Fabien Gaie Thomas Rolf CS 229 Stay Alert! The Ford Challenge Louis Fourrier Fabien Gaie Thomas Rolf Louis Fourrier Fabien Gaie Thomas Rolf 1. Problem description a. Goal Our final project is a recent Kaggle competition submitted

More information

Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation

Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation Daniel Lowd January 14, 2004 1 Introduction Probabilistic models have shown increasing popularity

More information