Ranking and selection problem with 1,016,127 systems. requiring 95% probability of correct selection. solved in less than 40 minutes

Ranking and selection problem with 1,016,127 systems requiring 95% probability of correct selection solved in less than 40 minutes with 600 parallel cores

with near-linear scaling.. Wallclock time (sec) 10 4 Perfect scaling Actual performance 10 3 60 96 120 240 360 480 600 960 Number of cores

and High utilization of computing budget

Ranking and Selection in a High Performance Computing Environment Eric Cao Ni Susan R. Hunter Shane G. Henderson School of Operations Research and Information Engineering, Cornell University School of Industrial Engineering, Purdue University Supported by NSF grant CMMI-1200315 and Extreme Science and Engineering Discovery Environment (XSEDE), NSF grant OCI-1053575. Winter Simulation Conference Washington, DC, December 9, 2013

1 Introduction 2 Considerations for parallel procedures 3 The Algorithm 4 Summary Introduction Considerations for parallel procedures The Algorithm Summary 1/26

1 Introduction 2 Considerations for parallel procedures 3 The Algorithm 4 Summary Introduction Considerations for parallel procedures The Algorithm Summary 2/26

Parallelism in simulation Two main research areas related to exploiting parallelism in discrete-event simulation Many processors on a single replication (Fujimoto, 2000) Many processors on independent replications (Heidelberger, 1988; Glynn and Heidelberger, 1990, 1991) Introduction Considerations for parallel procedures The Algorithm Summary 3/26

Ranking and Selection (R&S) max i S y(i) = E[Y (i; ξ)] Optimize a function through a stochastic simulation Function evaluated with error Feasible region is finite: K = S <. No assumption on topology of S. Want to find the best system i S with a certain degree of statistical confidence: P[select system j : g(j) g(i) δ, i S] 1 α Introduction Considerations for parallel procedures The Algorithm Summary 4/26

Parallelism in simulation optimization Many existing simulation optimization (in particular, R&S) algorithms are sequential in nature (Paulson, 1964; Fabian, 1974; Kim and Nelson, 2001, 2006; Hong, 2006). Past studies on parallel ranking and selection procedures include Chen (2005) and Luo and Hong (2011). Luo et al. (2013) proved an asymptotically valid parallel ranking and selection procedure. We propose a parallel algorithm for R&S that is Valid (maintains a required probability of correct selection) Efficient (speeds up as more cores are employed) Introduction Considerations for parallel procedures The Algorithm Summary 5/26

Assumptions on simulation output Y ijk : the output of the kth replication of simulating System i on core j, where 1 i S, 1 j W, k = 1, 2,... T ijk : the (random) completion time The cores produce i.i.d. replicates of (Y i, T i ) for each i S. Y i is marginally normally distributed with finite mean µ i and finite (and possibly unknown) variance σ 2 i. E[T i ] < for all i S. {Y i : 1 i S } can be correlated: possible to use Common Random Numbers (CRN). Introduction Considerations for parallel procedures The Algorithm Summary 6/26

The computing environment (1) A pre-specified, fixed number of cores are always available and do not fail or suddenly become unavailable; (2) The cores are identical and capable of message-passing. (3) Communication between cores is nearly instantaneous; (4) Messages join a queue for processing by the receiving core and are never lost. We implemented our algorithm in C/C++ using Message Passing Interface (MPI), tested on Extreme Science and Engineering Discovery Environment (XSEDE) s Lonestar cluster. Introduction Considerations for parallel procedures The Algorithm Summary 7/26

A simple master-worker framework One core is designated the master and the others are workers The master core monitors the progress of the algorithm and distributes work to workers Worker cores produce simulation replications according to the master s instruction. Each worker only exchanges information with the master Source: Hadoop Illuminated, M Kerzner and S Maniyam Introduction Considerations for parallel procedures The Algorithm Summary 8/26

1 Introduction 2 Considerations for parallel procedures 3 The Algorithm 4 Summary Introduction Considerations for parallel procedures The Algorithm Summary 9/26

Naive parallellism does not work Example: 1 system, 2 workers. Worker j produces i.i.d. replications ((Y j1, T j1 ), (Y j2, T j2 ),...) Y j is (marginally) { Normal(0, 1). 1 if Yjk < 0, T jk = 2 if Y jk 0. Let (Y 1, T 1 ) be the outcome of the first replication completed. P(Y 11 < 0, Y 21 < 0) }{{} Y 1 <0,T 1 =1 = P(Y 11 0, Y 21 < 0) }{{} Y 1 <0,T 1 =1 Y 1 is NOT normal, and E[Y 1 ] 0! = P(Y 11 < 0, Y 21 0) }{{} Y 1 <0,T 1 =1 = P(Y 11 0, Y 21 0) = 1/4. }{{} Y 1 0,T 1 =2 Introduction Considerations for parallel procedures The Algorithm Summary 9/26

Naive parallellism does not work In general, the set of replications that have completed by a fixed time may not be i.i.d. with the correct distribution (Heidelberger, 1988; Glynn and Heidelberger, 1990, 1991). Solution: Use estimators based on a fixed number of replications in a random completion time, in a pre-determined order. Introduction Considerations for parallel procedures The Algorithm Summary 10/26

Naive parallellism does not work 1 1 3 1 2 3 T 1 T 2 T 3 Figure 1: An illustration of the simulation results collected on Master. Notice that at T 2, a valid estimator uses only the output from replication 1. Introduction Considerations for parallel procedures The Algorithm Summary 11/26

Screening can be expensive In sequential R&S procedures, screening is often pairwise and periodic: Each system is compared with all other surviving system after one (or a few) additional replication. This may prove problematic in a parallel setting: With multiple workers, the generation of replications are much faster. In contrast, screening may become too much work for any single core. Introduction Considerations for parallel procedures The Algorithm Summary 12/26

Don t screen all pairs Systems 0 10 20 30 40 0 10 Systems 20 30 40 Within-core screening Best of master core Figure 2: Screening on the Master Solution 1: Distribute screening among workers. Solution 2: Perform a subset of pairwise screens. Introduction Considerations for parallel procedures The Algorithm Summary 13/26

Don t screen all pairs Systems 0 10 20 30 40 0 Systems 0 10 20 30 40 0 10 Core 1 10 Systems 20 30 Systems 20 30 Core 2 Core 3 40 Within-core screening Best of master core 40 Core 4 Within-core screening Between-core screening Best of each core Core 5 Figure 3: Screening on the Master Figure 4: Solution 1: Distribute screening to workers. Solution 2: Perform a subset of pairwise screens. Screening on the Workers Introduction Considerations for parallel procedures The Algorithm Summary 14/26

Don t screen all pairs Solution 2: Perform a subset of pairwise screens. Proposition 1 The statistical guarantee is preserved if some pairs are dropped from screening, if such guarantee is based on the Bonferroni argument P(ICS) K 1 i=1 P(A ik ) where A ik is the event that some inferior system i incorrectly eliminates the best system K. Thus, each worker may only screen among the systems assigned to it, plus against other worker s best systems. Introduction Considerations for parallel procedures The Algorithm Summary 15/26

Screening can be expensive Solution 3: Do not screen on every replication Proposition 2 Screening on a pre-determined subsequence of replications does not decrease the probability of correct selection. Proof follows directly from Jennison et al. (1980, 1982). The subsequences have to be pre-determined to avoid the bias implied by random completion time: between each pair of systems (i 1, i 2 ), screen at replications (bn 1, bn 2 ), for b = 1, 2,... and possibly unequal n 1, n 2. The step count b does not need to be equal among all pairs: screening can be made asynchronous. Introduction Considerations for parallel procedures The Algorithm Summary 16/26

Handling random number generation Random numbers generated across the workers should be independent. The workers generate identical random numbers if no specific instruction is given! Solution: Use the RngStream MCG generator with streams and substreams, proposed in L Ecuyer et al. (2002), in the master-worker framework: At initialization, the master generates one stream Z j for each worker j. When worker j simulates system i, it uses a fixed amount of random numbers under the ith substream, Z ji. Introduction Considerations for parallel procedures The Algorithm Summary 17/26

1 Introduction 2 Considerations for parallel procedures 3 The Algorithm 4 Summary Introduction Considerations for parallel procedures The Algorithm Summary 18/26

A three-stage parallel R&S algorithm Stage 0: We simulate all systems to estimate simulation completion times. Stage 1: (If variances need to be estimated) Independently of Stage 0, systems are simulated by workers to obtain variance estimates. Stage 2: Remaining systems are simulated and screened until one system remains. Introduction Considerations for parallel procedures The Algorithm Summary 18/26

Don t screen all pairs Systems 0 10 20 30 40 0 Systems 0 10 20 30 40 0 10 Core 1 10 Systems 20 30 Systems 20 30 Core 2 Core 3 40 Within-core screening Best of master core 40 Core 4 Within-core screening Between-core screening Best of each core Core 5 Figure 5: Screening on the Master Figure 6: Screening on the Workers Introduction Considerations for parallel procedures The Algorithm Summary 19/26

A three-stage parallel R&S algorithm Strategy 1: Dedicate both simulation and screening of each system to a worker Figure 7: Utilization of workers using Strategy 1 Introduction Considerations for parallel procedures The Algorithm Summary 20/26

A three-stage parallel R&S algorithm Strategy 2: Only screening is partitioned and dedicated to a worker Figure 8: Utilization of workers using Strategy 2 Introduction Considerations for parallel procedures The Algorithm Summary 21/26

Numerical example Our parallel algorithm is applied to a throughput-maximization problem (SimOpt.org) for which Luo et al. (2013) solved a version with 3,249 systems. We solve the problem with 1,016,127 systems in consideration. Introduction Considerations for parallel procedures The Algorithm Summary 22/26

Performance on 1,016,127 systems Wallclock time (sec) 10 4 Perfect scaling Actual performance 10 3 60 96 120 240 360 480 600 960 Number of cores Introduction Considerations for parallel procedures The Algorithm Summary 23/26

1 Introduction 2 Considerations for parallel procedures 3 The Algorithm 4 Summary Introduction Considerations for parallel procedures The Algorithm Summary 24/26

Summary We proposed a R&S procedure in a high-performance parallel computing environment which is capable of solving large-scale R&S problems. The statistical guarantee is maintained through Screening on subsequences Carefully managing random number generation Parallelizing both simulation and screening leads to decent speed-up Introduction Considerations for parallel procedures The Algorithm Summary 24/26

What s next? Consider other computing architectures Eliminate the master i.e. bottleneck Cloud platform where cores are less reliable High switching cost Compare with parallel versions of two-stage procedures Test on different problems Introduction Considerations for parallel procedures The Algorithm Summary 25/26

Thank you! Questions? 26/26

References I E. Jack Chen. Using parallel and distributed computing to increase the capability of selection procedures. In Proceedings of the 37th Conference on Winter Simulation, WSC 05, pages 723 731. Winter Simulation Conference, 2005. ISBN 0-7803-9519-0. URL http://dl.acm.org/citation.cfm?id=1162708.1162832. V. Fabian. Note on anderson s sequential procedures with triangular boundary. Annals of Statistics, 2(1):170 176, 1974. R. M. Fujimoto. Parallel and Distributed Simulation Systems. Wiley, New York, 2000. P. W. Glynn and P. Heidelberger. Bias properties of budget constrained simulations. Operations Research, 38:801 814, 1990. References 1/5

References II P. W. Glynn and P. Heidelberger. Analysis of parallel replicated simulations under a completion time constraint. ACM Transactions on Modeling and Computer Simulation, 1 (1):3 23, 1991. P. Heidelberger. Discrete event simulations and parallel processing: statistical properties. Siam J. Stat. Comput., 9 (6):1114 1132, 1988. L. Jeff Hong. Fully sequential indifference-zone selection procedures with variance-dependent sampling. Naval Research Logistics (NRL), 53(5):464 476, 2006. C. Jennison, I.M. Johnstone, and B.W. Turnbull. Asymptotically optimal procedures for sequential adaptive selection of the best of several normal means. Technical Report 463, School of Operations Research and Industrial Engineering, Cornell University, Ithaca NY, 1980. References 2/5

References III C. Jennison, I.M. Johnstone, and B.W. Turnbull. Asymptotically optimal procedures for sequential adaptive selection of the best of several normal means. In S.S. Gupta and J.O. Berger, editors, Statistical Decision Theory and Related Topics III, vol. 2., pages 55 86. Academic Press, New York, New York, 1982. S.-H. Kim and B. L. Nelson. Selecting the best system. In S. G. Henderson and B. L. Nelson, editors, Simulation, Handbooks in Operations Research and Management Science, pages 501 534. Elsevier, Amsterdam, 2006. Seong-Hee Kim and Barry L. Nelson. A fully sequential procedure for indifference-zone selection in simulation. ACM Transactions on Modeling and Computer Simulation (TOMACS), 11(3):251 273, 2001. References 3/5

References IV P. L Ecuyer, R. Simard, E. J. Chen, and W. D. Kelton. An objected-oriented random-number package with many long streams and substreams. Operations Research, 50(6): 1073 1075, 2002. Jun Luo and L. Jeff Hong. Large-scale ranking and selection using cloud computing. In Proceedings of the Winter Simulation Conference, WSC 11, pages 4051 4061. Winter Simulation Conference, 2011. URL http://dl.acm.org/citation.cfm?id=2431518.2432002. Jun Luo, Jeff L. Hong, Barry L. Nelson, and Yang Wu. Fully sequential procedures for large-scale ranking-and-selection problems in parallel computing environments. Submitted, 2013. References 4/5

References V E. Paulson. A sequential procedure for selecting the population with the largest mean from k normal populations. Annals of Mathematical Statistics, 35(1):174 180, 1964. References 5/5