Parallel Performance of the IMSL C Numerical Library

Parallel Performance of the IMSL C Numerical Library Benchmarking OpenMP A White Paper by Rogue Wave Software October 2012 Update Rogue Wave Software 5500 Flatiron Parkway, Suite 200 Boulder, CO 80301, USA www.rougewave.com

Parallel Performance of the IMSL C Numerical Library Benchmarking OpenMP by Rogue Wave Software 2012 by Rogue Wave Software. All Rights Reserved Printed in the United States of America Publishing History: October 2012 - Update November 2008 First Publication Trademark Information The Rogue Wave Software name, logo, IMSL and PV-WAVE are registered trademarks of Rogue Wave Software, Inc. or its subsidiaries in the US and other countries. JMSL, JWAVE, TS-WAVE, and PyIMSL are trademarks of Rogue Wave Software, Inc. or its subsidiaries. All other company, product or brand names are the property of their respective owners. IMPORTANT NOTICE: The information contained in this document is subject to change without notice. Rogue Wave Software, Inc. makes no warranty of any kind with regards to this material, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose. Rogue Wave Software, Inc. shall not be liable for errors contained herein or for incidental, consequential, or other indirect damages in connection with the furnishing, performance, or use of this material.

TABLE OF CONTENTS Abstract...4 Parallel Performance of the IMSL C Numerical Library...5 Thread-Safe User Functions...5 Amdahl s Law...5 Measurements...6 Windows with Visual Studio 2010 (32-bit)...6 Windows with Visual Studio 2010 (64-bit)...11 Red Hat 6 with gcc 4.4.4 (32-bit)...15 Red Hat 6 with gcc 4.4.4 (64-bit)...19 Sun Solaris Intel (64-bit)...22 Test Problems...25 Quadrature Functions...25 Optimization Neural Network Problem...25 Full Set of Benchmark Problems...26 Sample Benchmarking Code...29 Conclusion...32 About the IMSL Libraries...33 About the Author...33 About Rogue Wave Software...33

Abstract The use of OpenMP directives within the IMSL C Numerical Library is an effective means of achieving scalable parallelism in a portable fashion across a wide variety of computing platforms. While other technologies exist for SMP parallelization, OpenMP is a stable, wellknown and mature choice with broad platform support, which is a key for the IMSL C Numerical Library. This paper discusses a wide range of algorithms parallelized in IMSL C Numerical Library, but this is not an exhaustive list; please refer to the product documentation for a complete list. 4 www.roguewave.com

Parallel Performance of the IMSL C Numerical Library The IMSL C Numerical Library uses OpenMP to enable parallelism on shared memory systems, especially multi-core systems. Starting with Version 7.0 of the IMSL C Library, released in November 2008, OpenMP directives were added to a variety of functions in the library. Additional algorithms have been updated to use OpenMP directives since the release of Version 7.0. The goal is to take advantage of multi-core systems while minimizing impact on existing user code that references the IMSL C Library and also minimizing the engineering resources. Under these constraints, existing algorithms in the library were not re-written, but instead parallelized by added OpenMP directives wherever sensible. Because high performance vendor libraries are linked by the IMSL C Library for best performance for many linear algebra functions, the focus was on parallelizing other sections of the library. The performance of selected IMSL C Numerical Library functions was measured on multi-core systems. Each function was used to solve a large enough problem to allow for parallelism. Each test case was run with a varying number of threads allowed. The number of threads was set using the OpenMP function omp_set_num_threads. The specific problems run are described at the end of this paper. Thread-Safe User Functions Many of the IMSL C Library functions which use OpenMP can evaluate user-defined functions in parallel. This requires that the user s functions be thread-safe. For backward compatibility with existing user code, the IMSL C Library does not assume user functions are thread-safe and will not evaluate such functions in parallel. They are evaluated in parallel only if they are flagged as thread-safe by calling: for the IMSL C Math Library, imsl_omp_options(imsl_set_functions_thread_safe, 1, 0); and for the IMSL C Stat Library, imsls_omp_options(imsls_set_functions_thread_safe, 1, 0); These calls were made for all of the benchmarks described in this paper. Amdahl s Law Amdahl's Law relates the speedup using parallel processors versus using only one serial processor. The speedup, S, using N processors is 1 S N P (1 P) N 5 www.roguewave.com

where P is the fraction of the code which is parallel. Even if an unlimited number of processors were available, the maximum speedup is limited to 1 S (1 P) For example, if 80% of the code is parallelized then the maximum speedup is 5. The IMSL C Library nonlinear regression function, imsls_d_nonlinear_regression, can be used to estimate the fraction of the code running in parallel, P, from the measured speedups at various values of N. This is only an approximation because the fraction of the code which is parallelized can be a function of N, but it allows for easy evaluation of the performance gained for each function. Measurements Each benchmark was run five times, and the reported results are the average elapsed wall clock time over all the five runs. The one exception is imsls_d_naive_bayes_trainer, which was run 250 times because the time for each run was small. Windows with Visual Studio 2010 (32-bit) Hardware: Dual Quad Core Xeon E5420 (Harpertown) (8 cores in total) 2.5GHz, 133MHz Front Side Bus Operating System: Windows 7 Compiler: Microsoft Visual Studio 2010 (32-bit) 6 www.roguewave.com

omp_set_num_threads 1 2 4 6 8 Fraction IMSL Function Time (seconds) Parallel imsl_d_int_fcn_alg_log 45.62 23.83 12.83 9.16 7.35 0.96 imsl_d_int_fcn_cauchy 97.54 52.25 28.42 21.08 16.63 0.95 imsl_d_int_fcn_fourier 9.01 4 3.12 2.91 3.77 0.77 imsl_d_int_fcn_hyper_rect 1.04 0.57 0.28 0.19 0.27 0.91 imsl_d_int_fcn_inf 92.13 51.5 28.15 20.94 15.07 0.95 imsl_d_int_fcn_sing 24.94 13.08 7.13 4.76 3.57 0.98 imsl_d_int_fcn_sing_pts 99.77 52.28 28.53 19.03 14.29 0.98 imsl_d_int_fcn_smooth 24.96 13.08 7.14 5.95 4.76 0.92 imsl_d_int_fcn_trig 57.02 32.08 17.84 14.27 10.71 0.92 imsl_d_fast_poisson_2d 18.13 15.76 14.99 14.14 13.94 0.26 imsl_d_constrained_nlp 73.26 37.18 19.09 13.02 10.03 0.99 imsl_d_min_con_gen_lin 532.08 267.79 135.09 90.48 68.69 1 imsl_d_min_uncon_multivar 48.42 24.43 12.36 8.31 6.33 0.99 imsl_d_nonlin_least_squares 28.31 14.31 7.29 4.91 3.78 0.99 imsl_d_covariances 4.98 2.56 1.45 1.05 1.22 0.9 imsl_d_random_poisson 2.7 1.84 1.45 1.33 1.39 0.59 imsls_d_nonlinear_optimization 300.27 152.44 78.14 53.96 73.76 0.92 imsls_d_nonlinear_regression 46.83 23.6 11.98 8.11 6.19 0.99 imsls_d_covariances 52.54 26.63 13.53 9.31 7.11 0.99 imsls_d_dissimilarities 20.7 15.65 10.64 7.63 6.76 0.75 imsls_d_factor_analysis 61.96 54.21 44.16 39.39 37.17 0.43 imsls_d_cluster_hierarchical 10.64 7.54 5.37 4.62 4.29 0.68 imsls_d_multivariate_normal_cdf 9.25 4.63 2.32 2.32 1.17 0.98 imsls_d_random_beta 9.41 5.16 3.03 2.32 1.97 0.9 imsls_d_random_exponential_mix 5.93 3.42 2.17 1.75 1.53 0.85 imsls_d_random_gamma 5.13 3.01 1.95 1.61 1.43 0.82 imsls_d_random_general_continuous 8.64 4.77 2.84 2.2 1.87 0.9 imsls_d_random_lognormal 15.55 13.7 12.79 12.48 12.32 0.24 imsls_d_random_normal 6.58 4.1 2.84 2.53 2.22 0.75 imsls_d_random_poisson 2.62 1.83 1.43 1.31 1.23 0.6 imsls_d_random_triangular 2.78 1.84 1.37 1.22 1.14 0.67 imsls_d_classification_trainer 143.27 91.63 64.55 56.35 51.74 0.73 imsls_d_genetic_algorithm 14.03 8.52 5.54 4.41 3.96 0.82 imsls_d_mlff_network_trainer 10.84 10.08 7.3 6.99 6.6 0.42 imsls_d_naive_bayes_trainer 0.03 0.01 0.01 0.01 0.01 0.86 7 www.roguewave.com

8 www.roguewave.com

9 www.roguewave.com

10 www.roguewave.com

Windows with Visual Studio 2010 (64-bit) Hardware: Dual Quad Core Xeon E5420 (Harpertown) (8 cores in total) 2.5GHz, 133MHz Front Side Bus Operating System: Windows 7 Compiler: Microsoft Visual Studio 2010 (64-bit) omp_set_num_threads 1 2 4 6 8 Fraction IMSL Function Time (seconds) Parallel imsl_d_int_fcn_alg_log 37.05 19.31 10.09 7.02 5.47 0.97 imsl_d_int_fcn_cauchy 80.17 43.15 23.15 17 13.17 0.95 imsl_d_int_fcn_fourier 6.93 4.99 3.91 3.65 3.35 0.58 imsl_d_int_fcn_hyper_rect 0.95 0.53 0.26 0.17 0.14 0.98 imsl_d_int_fcn_inf 63.99 36.75 21.19 15.05 11.15 0.93 imsl_d_int_fcn_sing 16.15 8.45 4.6 3.07 2.31 0.98 imsl_d_int_fcn_sing_pts 64.55 33.82 18.44 12.31 9.22 0.98 imsl_d_int_fcn_smooth 16.43 8.56 4.68 3.88 3.12 0.93 imsl_d_int_fcn_trig 36.89 20.75 11.53 9.22 6.93 0.92 imsl_d_fast_poisson_2d 13.03 12.28 11.83 11.72 11.58 0.12 imsl_d_constrained_nlp 58.09 29.64 15.3 10.53 8.19 0.98 imsl_d_min_con_gen_lin 310.37 156.05 78.72 52.85 40.3 0.99 imsl_d_min_uncon_multivar 46.04 23.21 11.75 7.89 6.01 0.99 imsl_d_nonlin_least_squares 23.17 11.75 5.98 4.04 3.1 0.99 imsl_d_covariances 5.38 2.78 1.53 1.08 0.86 0.96 imsl_d_random_poisson 2.15 1.51 1.28 1.22 1.22 0.52 imsls_d_nonlinear_optimization 255.07 131.47 69.97 49.69 40.16 0.96 imsls_d_nonlinear_regression 23.63 12 6.15 4.2 3.23 0.99 imsls_d_covariances 36.69 18.66 9.7 6.63 5.1 0.98 imsls_d_dissimilarities 17.61 13.7 9.56 7.43 5.82 0.72 imsls_d_factor_analysis 55.29 48.28 39.08 34.79 32.73 0.43 imsls_d_cluster_hierarchical 10.17 7.22 5.27 4.52 4.21 0.66 imsls_d_multivariate_normal_cdf 6.38 3.2 1.7 1.62 0.84 0.97 imsls_d_random_beta 8.88 4.88 2.87 2.22 1.89 0.9 imsls_d_random_exponential_mix 3.62 2.25 1.59 1.34 1.25 0.75 imsls_d_random_gamma 2.73 1.81 1.34 1.23 1.12 0.67 imsls_d_random_general_continuous 7.8 4.34 2.62 2.04 1.75 0.89 imsls_d_random_lognormal 11.97 10.97 10.5 10.33 10.25 0.16 imsls_d_random_normal 6.37 3.96 2.78 2.5 2.31 0.73 imsls_d_random_poisson 1.97 1.5 1.28 1.22 1.19 0.46 imsls_d_random_triangular 2.36 1.61 1.25 1.12 1.08 0.62 imsls_d_classification_trainer 109.31 72.57 52.99 46.78 44.54 0.68 imsls_d_genetic_algorithm 12.09 7.22 4.51 3.57 3.18 0.84 imsls_d_mlff_network_trainer 10.22 8.28 6.07 6.24 6.04 0.48 imsls_d_naive_bayes_trainer 0.02 0.01 0.01 0.01 0.01 0.87 11 www.roguewave.com

12 www.roguewave.com

13 www.roguewave.com

14 www.roguewave.com

Red Hat 6 with gcc 4.4.4 (32-bit) Hardware: Dual Quad Core Xeon E5420 (Harpertown) (8 cores in total) 2.5GHz, 133MHz Front Side Bus Operating System: Red Hat Enterprise Linux 6 Compiler: gcc 4.4.4 (32-bit) omp_set_num_threads 1 2 4 6 8 Fraction IMSL Function Time (seconds) Parallel imsl_d_int_fcn_alg_log 57.73 30.15 15.91 11.15 8.78 0.97 imsl_d_int_fcn_cauchy 125.33 67.48 36.35 26.78 20.8 0.95 imsl_d_int_fcn_fourier 9.07 4.99 3.9 4.72 3.36 0.69 imsl_d_int_fcn_hyper_rect 1.09 0.54 0.27 0.18 0.14 0.99 imsl_d_int_fcn_inf 92.13 51.5 28.15 20.94 15.07 0.95 imsl_d_int_fcn_sing 24.94 13.08 7.13 4.76 3.57 0.98 imsl_d_int_fcn_sing_pts 99.77 52.28 28.53 19.03 14.29 0.98 imsl_d_int_fcn_smooth 24.96 13.08 7.14 5.95 4.76 0.92 imsl_d_int_fcn_trig 57.02 32.08 17.84 14.27 10.71 0.92 imsl_d_fast_poisson_2d 18.13 15.76 14.99 14.14 13.94 0.26 imsl_d_constrained_nlp 73.26 37.18 19.09 13.02 10.03 0.99 imsl_d_min_con_gen_lin 532.08 267.79 135.09 90.48 68.69 1 imsl_d_min_uncon_multivar 48.42 24.43 12.36 8.31 6.33 0.99 imsl_d_nonlin_least_squares 24.64 12.62 6.58 4.56 3.56 0.98 imsl_d_covariances 4.97 2.58 1.38 1.08 0.88 0.94 imsl_d_random_poisson 4.61 3.45 2.88 2.69 2.70 0.49 imsls_d_nonlinear_optimization 217.91 119.38 68.89 54.78 61.55 0.86 imsls_d_nonlinear_regression 25.58 13.21 7.04 4.99 3.94 0.97 imsls_d_covariances 53.69 27.16 13.83 9.78 7.37 0.99 imsls_d_dissimilarities 21.07 16.23 9.38 7.33 7.17 0.76 imsls_d_factor_analysis 50.74 43.32 33.72 29.33 26.79 0.5 imsls_d_cluster_hierarchical 11.4 8.29 5.98 5.43 4.88 0.64 imsls_d_multivariate_normal_cdf 10.17 5.11 2.57 2.56 1.29 0.98 imsls_d_random_beta 14.89 8.1 4.7 3.57 3 0.91 imsls_d_random_exponential_mix 6.48 3.89 2.59 2.16 1.95 0.8 imsls_d_random_gamma 7.01 4.16 2.72 2.25 2.01 0.81 imsls_d_random_general_continuous 8.92 5.11 3.21 2.58 2.25 0.85 imsls_d_random_lognormal 20.15 16.17 14.32 13.55 13.18 0.39 imsls_d_random_normal 8.57 5.3 3.66 3.11 2.83 0.77 imsls_d_random_poisson 4.42 3.38 2.84 2.66 2.65 0.47 imsls_d_random_triangular 3.05 2.04 1.67 1.55 1.5 0.59 imsls_d_classification_trainer 155.35 97.43 69.3 59.63 54.82 0.74 imsls_d_genetic_algorithm 14.84 9.32 6.21 5.2 4.65 0.78 imsls_d_mlff_network_trainer 30.36 22.11 16.09 14.7 13.93 0.62 imsls_d_naive_bayes_trainer 0.03 0.02 0.01 0.01 0.01 0.89 15 www.roguewave.com

16 www.roguewave.com

17 www.roguewave.com

18 www.roguewave.com

Red Hat 6 with gcc 4.4.4 (64-bit) Hardware: Dual Quad Core Xeon E5420 (Harpertown) (8 cores in total) 2.5GHz, 133MHz Front Side Bus Operating System: Red Hat Enterprise Linux 6 Compiler: gcc 4.4.4 (64-bit) omp_set_num_threads 1 2 4 6 8 Fraction IMSL Function Time (seconds) Parallel imsl_d_int_fcn_alg_log 50.35 26.24 13.76 9.6 7.5 0.97 imsl_d_int_fcn_cauchy 108.64 58.31 31.5 23.16 17.92 0.95 imsl_d_int_fcn_fourier 10.15 7.29 5.66 5.26 4.84 0.59 imsl_d_int_fcn_hyper_rect 1.08 0.58 0.29 0.2 0.15 0.98 imsl_d_int_fcn_inf 89.01 47.67 24.6 18.17 12.43 0.98 imsl_d_int_fcn_sing 21.81 11.42 6.23 4.23 3.17 0.97 imsl_d_int_fcn_sing_pts 87.34 45.73 24.95 16.93 12.68 0.97 imsl_d_int_fcn_smooth 23.01 12.06 6.59 5.49 4.39 0.92 imsl_d_int_fcn_trig 49.83 28 15.72 12.6 9.45 0.92 imsl_d_fast_poisson_2d 9.23 7.99 7.33 7.13 7.02 0.27 imsl_d_constrained_nlp 70.7 35.85 18.39 12.51 9.65 0.99 imsl_d_min_con_gen_lin 490.46 246.67 124.46 83.38 63.31 1 imsl_d_min_uncon_multivar 46.84 23.59 11.94 8.02 6.11 0.99 imsl_d_nonlin_least_squares 23.71 12.1 6.28 4.34 3.36 0.98 imsl_d_covariances 4.47 2.29 1.2 0.94 0.75 0.95 imsl_d_random_poisson 2.49 2.01 1.8 1.74 1.73 0.36 imsls_d_nonlinear_optimization 388.32 200.36 104.21 73.51 57.6 0.97 imsls_d_nonlinear_regression 24.36 12.44 6.48 4.51 3.51 0.98 imsls_d_covariances 36.58 18.46 9.37 6.59 4.97 0.99 imsls_d_dissimilarities 15.52 12.95 8.17 6.06 5.36 0.72 imsls_d_factor_analysis 42.02 35.44 27.24 23.62 21.62 0.52 imsls_d_cluster_hierarchical 7.44 5.43 4.08 3.73 3.48 0.6 imsls_d_multivariate_normal_cdf 9.49 4.77 2.39 2.39 1.2 0.98 imsls_d_random_beta 13.43 7.53 4.57 3.59 3.09 0.88 imsls_d_random_exponential_mix 8.74 5.18 3.4 2.81 2.51 0.81 imsls_d_random_gamma 7.97 4.8 3.22 2.68 2.42 0.8 imsls_d_random_general_continuous 7.08 4.35 2.98 2.53 2.3 0.77 imsls_d_random_lognormal 11.83 10.08 9.22 8.92 8.78 0.3 imsls_d_random_normal 7.04 4.56 3.27 2.83 2.62 0.72 imsls_d_random_poisson 2.42 2.01 1.8 1.75 1.73 0.33 imsls_d_random_triangular 2.94 2.29 1.96 1.84 1.81 0.44 imsls_d_classification_trainer 118.24 72.28 48.63 42.69 37.03 0.78 imsls_d_genetic_algorithm 13.55 8.11 5.06 4.08 3.54 0.84 imsls_d_mlff_network_trainer 8.95 8.46 7.57 7.23 6.71 0.24 imsls_d_naive_bayes_trainer 0.02 0.01 0.01 0.01 0.01 0.89 19 www.roguewave.com

20 www.roguewave.com

21 www.roguewave.com

Sun Solaris Intel (64-bit) Hardware: Dual Quad Core Xeon E5420 (Harpertown) (8 cores in total) 2.5GHz, 133MHz Front Side Bus Operating System: Solaris 10 Compiler: Oracle Solaris Studio 12.2 (64-bit) omp_set_num_threads 1 2 4 6 8 Fraction IMSL Function Time (seconds) Parallel imsl_d_int_fcn_alg_log 50.35 26.24 13.76 9.6 7.5 0.97 imsl_d_int_fcn_cauchy 108.64 58.31 31.5 23.16 17.92 0.95 imsl_d_int_fcn_fourier 10.15 7.29 5.66 5.26 4.84 0.59 imsl_d_int_fcn_hyper_rect 1.08 0.58 0.29 0.2 0.15 0.98 imsl_d_int_fcn_inf 63.99 36.75 21.19 15.05 11.15 0.93 imsl_d_int_fcn_sing 16.15 8.45 4.6 3.07 2.31 0.98 imsl_d_int_fcn_sing_pts 64.55 33.82 18.44 12.31 9.22 0.98 imsl_d_int_fcn_smooth 16.43 8.56 4.68 3.88 3.12 0.93 imsl_d_int_fcn_trig 36.89 20.75 11.53 9.22 6.93 0.92 imsl_d_fast_poisson_2d 13.03 12.28 11.83 11.72 11.58 0.12 imsl_d_constrained_nlp 58.09 29.64 15.3 10.53 8.19 0.98 imsl_d_min_con_gen_lin 310.37 156.05 78.72 52.85 40.3 0.99 imsl_d_min_uncon_multivar 46.04 23.21 11.75 7.89 6.01 0.99 imsl_d_nonlin_least_squares 23.17 11.75 5.98 4.04 3.1 0.99 imsl_d_covariances 5.38 2.78 1.53 1.08 0.86 0.96 imsl_d_random_poisson 2.15 1.51 1.28 1.22 1.22 0.52 imsls_d_nonlinear_optimization 255.07 131.47 69.97 49.69 40.16 0.96 imsls_d_nonlinear_regression 23.63 12 6.15 4.2 3.23 0.99 imsls_d_covariances 36.69 18.66 9.7 6.63 5.1 0.98 imsls_d_dissimilarities 17.61 13.7 9.56 7.43 5.82 0.72 imsls_d_factor_analysis 55.29 48.28 39.08 34.79 32.73 0.43 imsls_d_cluster_hierarchical 10.17 7.22 5.27 4.52 4.21 0.66 imsls_d_multivariate_normal_cdf 6.38 3.2 1.7 1.62 0.84 0.97 imsls_d_random_beta 8.88 4.88 2.87 2.22 1.89 0.9 imsls_d_random_exponential_mix 3.62 2.25 1.59 1.34 1.25 0.75 imsls_d_random_gamma 2.73 1.81 1.34 1.23 1.12 0.67 imsls_d_random_general_continuous 7.8 4.34 2.62 2.04 1.75 0.89 imsls_d_random_lognormal 11.97 10.97 10.5 10.33 10.25 0.16 imsls_d_random_normal 6.37 3.96 2.78 2.5 2.31 0.73 imsls_d_random_poisson 1.97 1.5 1.28 1.22 1.19 0.46 imsls_d_random_triangular 2.36 1.61 1.25 1.12 1.08 0.62 imsls_d_classification_trainer 109.31 72.57 52.99 46.78 44.54 0.68 imsls_d_genetic_algorithm 12.09 7.22 4.51 3.57 3.18 0.84 imsls_d_mlff_network_trainer 10.22 8.28 6.07 6.24 6.04 0.48 imsls_d_naive_bayes_trainer 0.02 0.01 0.01 0.01 0.01 0.87 22 www.roguewave.com

23 www.roguewave.com

24 www.roguewave.com

Test Problems The benchmarks include some problems used to benchmark several different IMSL C Library functions. This section provides details on the various test problems in the benchmark suite. Quadrature Functions M 1N 1 x 1 ( x) ( i j /100) i3 j3 f f M 1N 1 x 2 ( x) ( i j /100) i3 j3 Optimization Neural Network Problem with M=100,000 and N=100. with M=100,000 and N=100. A simple neural network problem was used to benchmark the optimization functions. The network has 20 inputs, 20 perceptrons in a single hidden layer, and a single output. All of the input nodes are connected to all of the perceptrons in the hidden layer. All of the hidden layer perceptrons are connected to the single output node. The network has 441 weights, which are the unknowns in the optimization problem. Each optimization function was used to determine the value of the weights such that the network best fit 10,000 observations; i.e., the network was trained with 10,000 patterns. If the network evaluation function is N(x;w) the optimization problem is 25 www.roguewave.com

N min w i0 1 2 N ( x ; w) y where N = 10,000, x and y are the observations ( training patterns ) and w is the weights. The observations are Full Set of Benchmark Problems y x ij imsl_d_int_fcn_alg_log 1 1/4 3/4 f (x)x ( 1- x) log(x)dx 0 1 imsl_d_int_fcn_cauchy 1 f1( x) dx x 0.5 0 i i cos( 301i 401j) N 1 j0 cos(301i 401j) imsl_d_int_fcn_fourier n1 j sin cos x sin xdx a j0 n where a = 2.5, ω = 3.5, and n = 20,000. imsl_d_int_fcn_hyper_rect 1 1 9 0 0 exp ( xi ) dx9 dx0 where is the normal cumulative distribution function. i0 imsl_d_int_fcn_inf 0 f 2 ( x) dx imsl_d_int_fcn_sing 1 0 f 1 x) ( dx imsl_d_int_fcn_sing_pts 1 0 f ( dx with singular points given as 0.3, 0.5 and 0.7. 1 x) imsl_d_int_fcn_smooth 1 0 f ( dx 1 x) i 26 www.roguewave.com

imsl_d_int_fcn_trig 1 0 f ( dx where ω = 2.5. 1 x)cos( x) imsl_d_fast_poisson_2d 2 2 u u 2x3y The PDE 3u 2sin( x 2y) 16e with the boundary condition 2 2 x y u 2x y 2cos( x 2y) 3e 3 2x3y on the bottom side and u sin( x 2y) e y on the other three sides was solved on a 4,096 by 4,096 grid. imsl_d_constrained_nlp The optimization neural network problem was used with the additional requirement that the weights are bounded to be in [-3,3]. imsl_d_min_con_gen_lin The optimization neural network problem was used with the additional requirement that the weights are bounded to be in [-3,3]. imsl_d_min_uncon_multivar The optimization neural network problem was used. imsl_d_nonlin_least_squares The optimization neural network problem was used. imsl_d_covariances The variance-covariance matrix was computed from a random matrix with 100 observations and 5,000 variables with no missing values. imsl_d_random_poisson Generates 10 8 random numbers with a mean value of 0.5. imsls_d_nonlinear_optimization The optimization neural network problem was used. imsls_d_nonlinear_regression The optimization neural network problem was used. imsls_d_covariances The variance-covariance matrix was computed from an array of 100 observations by 4,000 variables. This array contained random numbers with about 1% set to NaN (missing value indicator). Variances used are computed from the valid pairs of data. The value of the optional argument, IMSLS_MISSING_VALUE_METHOD, was 3. imsls_d_dissimilarities The dissimilarities matrix was computed from an array of 2,000 by 2,000 uniform random numbers, with one added to the diagonal. The Mahalanobis distance method was used. imsls_d_factor_analysis Factor analysis was performed on the 1,000 by 1,000 covariance matrix a 1 ij 0.2 ij i j 1, where is one if i equals j and is zero otherwise. ij 27 www.roguewave.com

imsls_d_cluster_hierarchical The hierarchical clustering was done on a uniform random 10,000 by 10,000 distance matrix. imsls_d_multivariate_normal_cdf The multivariate normal CDF was computed for k = 200 variates. The upper bounds were all equal to 5. The means were i i, for i = 0,, k. The variance-covariance 1 matrix was ij 0.2 ij, where ij is one if i equals j and is zero otherwise. i j 1 imsls_d_random_beta Generates 10 8 random numbers with p = 1 and q = 3. imsls_d_random_exponential_mix Generates 10 8 random with θ 1 = 2, θ 2 = 1, and p = 0.5. imsls_d_random_gamma Generates 10 8 random numbers with shape parameter equal to 1. imsls_d_random_general_continuous Generates 10 8 random numbers from a general continuous distribution, here the beta distribution with p = 3 and q = 2. imsls_d_random_lognormal Generates 10 8 random numbers with a mean value of 3 and a standard deviation of 2. imsls_d_random_normal Generates 10 8 random numbers. imsls_d_random_poisson Generates 10 8 random numbers with a mean value of 0.5. imsls_d_random_triangular numbers Generates 10 8 random numbers. imsls_d_mlff_classification_trainer The Poker Hand data set from the UCI Machine Learning Repository 1 was used. The set is a variation of the set from Robert Cattral and Franz Oppacher 2. The set contains 25,010 observations. Each observation consists of 10 nominal values resulting in 85 inputs to the classification neural network. The observations are classified into 10 classes. Training was done with 10 Stage I epochs, each with 6,000 patterns randomly chosen from the given 25,010 patterns. No Stage II training was performed. imsls_d_genetic_algorithm The n-queens problem was solved with the number of queens n = 100. The problem is to place n queens on an n by n chessboard such that no queen can capture another queen. 1 Asuncion, A. & Newman, D.J. (2007). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science. 2 R. Cattral, F. Oppacher, D. Deugo. Evolutionary Data Mining with Automatic Rule Generalization. Recent Advances in Computers, Computing and Communications, pp.296-300, WSEAS Press, 2002. 28 www.roguewave.com

imsls_d_mlff_network_trainer The optimization neural network problem was used. imsls_d_naive_bayes_trainer The Poker Hand dataset was used. This is the same dataset as used for benchmarking imsls_d_mlff_classification_trainer. The Naïve Bayes trainer used all 25,010 observations. Sample Benchmarking Code The following code benchmarks the functions imsls_d_random_normal, imsls_d_genetic_algorithm and imsl_d_fcn_inf. The test cases described above are used. This code requires OpenMP. The standard OpenMP function, omp_set_num_threads, is used to set the number of threads during each phase of the benchmarking. This code assumes the use of a maximum of eight threads. The functions imsl_omp_options and imsls_omp_options are used to indicate that the functions passed as arguments to the IMSL C Library functions are thread-safe. In this sample, these are the functions fcn1d_inf and queensfitness. By default, such user supplied functions are not evaluated in parallel. Each benchmark is run five times to reduce errors in the timings. The code includes a function, wall_clock_time, to measure the actual elapsed ( wall clock ) time in seconds. #include <stdio.h> #include <stdlib.h> #include <math.h> #include <imsl.h> #include <imsls.h> static double benchmark_random_normal(); static double benchmark_int_fcn_inf(); static double fcn1d_inf(double x); static double benchmark_generic_algorithm(); static double queensfitness(imsls_d_individual* individual, int* n_queens); static double wall_clock_time(); int main() { int num_threads[] = {1, 2, 4, 6, 8}; int n_threads = 5; int n_rep = 5; int n_benchmarks = 3; int i_threads, i_rep, i_benchmark; double *times; imsl_omp_options(imsl_set_functions_thread_safe, 1, 0); imsls_omp_options(imsls_set_functions_thread_safe, 1, 0); times = (double*)calloc(n_threads*n_benchmarks, sizeof(double)); for (i_threads = 0; i_threads < n_threads; i_threads++) { omp_set_num_threads(num_threads[i_threads]); for (i_rep = 0; i_rep < n_rep; i_rep++) 29 www.roguewave.com

} { } times[i_threads*n_benchmarks+0] += benchmark_random_normal(); times[i_threads*n_benchmarks+1] += benchmark_generic_algorithm(); times[i_threads*n_benchmarks+2] += benchmark_int_fcn_inf(); } printf("%s %s %s %s\n", "n_threads", "random_normal", "generic_algorithm", "int_fcn_inf"); for (i_threads = 0; i_threads < n_threads; i_threads++) { printf("%5d ", num_threads[i_threads]); for (i_benchmark = 0; i_benchmark < n_benchmarks; i_benchmark++) { printf("%20.2f", times[i_threads*n_benchmarks+i_benchmark]/n_rep); } printf("\n"); } static double benchmark_random_normal() { int n = (int)1.0e8; double time; double *random; } time = wall_clock_time(); random = imsls_d_random_normal(n, 0); time = wall_clock_time() - time; imsls_free(random); return time; static double benchmark_int_fcn_inf() { double time; double result; } time = wall_clock_time(); result = imsl_d_int_fcn_inf(fcn1d_inf, 0.0, IMSL_BOUND_INF, 0); time = wall_clock_time() - time; return time; static double fcn1d_inf(double x) { double sum = 0.0; int i, j; for (i = 3; i < 100000; i++) { for (j = 3; j < 100; j++) { 30 www.roguewave.com

} sum += pow(i+0.01*j, -x); } } return sum; static double benchmark_generic_algorithm() { int i; /* index variables */ int n = 500; /* population size */ int n_generations; /* number of generations */ int n_queens = 100; /* number of nominal phenotypes*/ int n_categories[100]; double maxfit; /* maximum fitness hurdle */ double* genstats; /* generation statistics */ Imsls_d_chromosome* chromosome; /* chromosome data structure */ Imsls_d_individual* best_individual; /* optimum */ Imsls_d_population* population; /* population data structure */ double time; imsls_random_seed_set(12345); /* * Setup Problem */ time = wall_clock_time(); maxfit = n_queens - 0.5; for(i = 0; i < n_queens; i++) { n_categories[i] = n_queens; } chromosome = imsls_d_ga_chromosome( IMSLS_NOMINAL, n_queens, n_categories, 0); population = imsls_d_ga_random_population(n, chromosome, IMSLS_PMX_CROSSOVER, IMSLS_FITNESS_FCN_WITH_PARMS, queensfitness, &n_queens, 0); /* * Solve Problem */ best_individual = imsls_d_genetic_algorithm(null, population, IMSLS_FITNESS_FCN_WITH_PARMS, queensfitness, &n_queens, IMSLS_PMX_CROSSOVER, IMSLS_LINEAR_SCALING, 2.0, IMSLS_CROSSOVER_PROB, 0.7, IMSLS_MUTATION_PROB, 0.01, IMSLS_MAX_GENERATIONS, 10000, IMSLS_MAX_FITNESS, maxfit, IMSLS_GENERATION_STATS, &genstats, IMSLS_N_GENERATIONS, &n_generations, 0); time = wall_clock_time() - time; imsls_d_ga_free_individual(best_individual); imsls_d_ga_free_population(population); 31 www.roguewave.com

} imsls_free(chromosome); return time; static double queensfitness(imsls_d_individual* individual, int* n_queens) { int i, j, k, p; /* Index variables */ int f = 0; /* Fitness value */ int *allele = individual->chromosome->allele; for(i = 0; i < *n_queens-1; i++) { for(j = i+1; j < *n_queens; j++) { k = allele[i] - allele[j]; p = i - j; if (k*k == p*k) f++; } } f = *n_queens - f; return (double)f; } static double wall_clock_time() { #ifdef _WIN32 return clock() / (double)(clocks_per_sec); #else struct timeval tp; gettimeofday(&tp, NULL); return (tp.tv_sec + 1.0e-6*tp.tv_usec); #endif } Conclusion The use of OpenMP directives within the IMSL C Numerical Library is an effective means of achieving scalable parallelism in a portable fashion across a wide variety of computing platforms. While other technologies exist for SMP parallelization, OpenMP is a stable, wellknown and mature choice with broad platform support, which is a key for the IMSL C Numerical Library. Some functions show very good, nearly linear, scaling up to 8 threads, which is quite good considering the algorithms were not re-written, but rather had OpenMP directives added in strategic places. This paper discusses a wide range of algorithms parallelized in IMSL C Numerical Library, but this is not an exhaustive list; please refer to the product documentation for a complete list. Finally, Rogue Wave has performed these benchmarks on standard systems using the publically available version of the software, and while we expect you should get similar results, it is always best to evaluate the algorithms you use on your deployment hardware for truly accurate results. 32 www.roguewave.com

About the IMSL Libraries The IMSL Numerical Libraries are a comprehensive set of mathematical and statistical functions that programmers can embed into their software applications. The libraries save development time by providing pre-written mathematical and statistical algorithms that can be embedded into C, C# for.net, Java and Fortran applications, enhancing return on investment and programmer productivity. The IMSL Libraries can also be used from Python. Beyond choice of programming language, the IMSL Libraries are supported across a wide range of hardware and operating system environments including Windows, Linux, Apple and many UNIX platforms. Go to http://www.roguewave.com/products/imsl-numerical-libraries.aspx to view our product information and links to datasheets, videos, tutorials, evaluation options and more. About the Author Dr. John Brophy, Ph.D. joined the IMSL product development team in 1983. Recently he has been leading in the design and implementation of the parallelization of selected functions in the IMSL C Numerical Library of which he was an original designer. Previously he led in the design and implementation of an object-oriented, native C# and Java, library for mathematics, statistics, finance and charting. He also worked with the Java Grande Forum to make Java a better language for numerical computing. About Rogue Wave Software Rogue Wave Software, Inc. is the largest independent provider of cross-platform software development tools and embedded components for the next generation of HPC applications. Rogue Wave marries High Performance Computing with High Productivity Computing to enable developers to harness the power of parallel applications and multicore computing. Rogue Wave products reduce the complexity of prototyping, developing, debugging, and optimizing multiprocessor and data-intensive applications. Rogue Wave customers are industry leaders in the Global 2000, ISVs, OEMs, government laboratories and research institutions that leverage computationally-complex and data-intensive applications to enable innovation and outperform competitors. For more information, visit http://www.roguewave.com. 33 www.roguewave.com