Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection

Size: px

Start display at page:

Download "Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection"

Daniella Lawson
5 years ago
Views:

1 Numerical Libraries in the DOE ACTS Collection The DOE ACTS Collection SIAM Parallel Processing for Scientific Computing, Savannah, Georgia Feb 15, 2012 Tony Drummond Computational Research Division Lawrence Berkeley National Laboratory

The DOE ACTS Collection Project Goal: The Advanced

and efficient software tools more widely used, and more

2 The DOE ACTS Collection Project Goal: The Advanced CompuTational Software Collection (ACTS) makes reliable and efficient software tools more widely used, and more effective in solving the nation s engineering and scientific problems. Tony Drummond and Osni Marques Computational Research Division Lawrence Berkeley National Laboratory

3 What is the Role of ACTS in the HPC Software Stack? APPLICATIONS GENERAL PURPOSE TOOLS PLATFORM SUPPORT TOOLS AND UTILITIES HARDWARE

development, run-time and library optimization GENERAL PURPOSE TOOLS APPLICATIONS ScaLAPACK PETSc

4 ACTS Plays a Critical Role in the HPC Software Stack Accelerate Application Code Development By maintaining a solid collection with some of the best numerical kernels and support tools for code development, run-time and library optimization GENERAL PURPOSE TOOLS APPLICATIONS ScaLAPACK PETSc Overture SLEPc Global Arrays TAO AztecOO PyACTS SuperLU TAU Hypre ATLAS PLATFORM SUPPORT TOOLS AND UTILITIES SUNDIALS HARDWARE

5 The DOE ACTS Collection Category Tool Functionalities Numerical AztecOO Scalable linear and non-linear solvers using iterative schemes. Hypre A family of scalable preconditioners. Code Development Run Time Support Library Development PETSc OPT++ SUNDIALS ScaLAPACK SLEPc SuperLU TAO Global Arrays Overture TAU ATLAS Scalable linear and non-linear solvers and additional support for PDE related work. Object-oriented nonlinear optimization solvers. Solvers for the solution of systems of ordinary differential equations, nonlinear algebraic equations, and differential-algebraic equations. High performance parallel dense linear algebra. Scalable algorithms for the solution of large sparse eigenvalue problems. Scalable direct solution of large, sparse, nonsymmetric linear systems of equations. Large-scale optimization software. Supports the development of parallel programs. Supports the development of computational fluid dynamics codes in complex geometries. Portable and scalable performance analyzes and tracing tools for C, C++, Fortran and Java programs. Automatic generation of optimized numerical dense algebra for scalar processors.

6 Numerical Functionality in the ACTS Collection Ax = b or AX = B Hx = b min x b Ax 2 min x x 2 min x b Ax 2 min x x 2 Az = λz A = UΣV T A = UΣV H Az = λbz ABz = λz BAz = λz Commonalities among ACTS Tools: General purpose user interfaces Parallel and Scalable implementations of numerical algorithms Modular design (kernel reusability) Parallelism exploited at the MPI_TASK level (newer versions under development to support other levels of concurrency)

7 Providing Sustainable and Scalable Performance for ACTS Tools in Multicore Systems TOOL DEVELOPERS Challenge is avoid code rewrite for performance APPLICATION DEVELOPERS APPLICATIONS CS Community efforts Hand-Tuned Codes Compiler opt Auto-tuning New Programming Environments

8 Providing Sustainable and Scalable Performance for ACTS Tools in Multicore Systems TOOL DEVELOPERS Development of new numerical functionalities and implementation Integration of auto-tuned kernels (BLAS, LAPACK, TORCH, etc..) Adoption of new programming models and paradigms APPLICATION DEVELOPERS Tool and Functionality Selection Choosing functional parameters Compile/Link: Integrating Optimized Kernels Runtime: Dynamic Kernel Selection Verification: Robustness and Scalability

9 Providing Sustainable and Scalable Performance for ACTS Tools in Multicore Systems Q. Can Performance Scalability be passed from libraries/tools to applications and across platforms and configurations?

Compiler Optimized vs Highly Tuned LU Solve (ScaLAPACK) &!" 0-#,1&+"/23&456777856777" %#" *"'!#$!*% % improvement %!" $#" $!" $$!!!" $&!!!" $#!!!"!"#$"%&'(" *"&!#$!*% *"!!#$!*% )"!!#$!!% ("!!#$!!% '"!

10 Compiler Optimized vs Highly Tuned LU Solve (ScaLAPACK) &!" 0-#,1&+"/23& " %#" *"'!#$!*% % improvement %!" $#" $!" $$!!!" $&!!!" $#!!!"!"#$"%&'(" *"&!#$!*% *"!!#$!*% )"!!#$!!% ("!!#$!!% '"!!#$!!% *+!!!% #"!" '()"*"+%,%-" '(")"."+%,*-" '(")"$/"+*,*-" '(")"%*+*,/-" Number of MPI_TASKS (np) &"!!#$!!%!"!!#$!!% '% )% *(% &'% )*+,&-"#$".#-&/" CRAY - XTE6 Execution time improvement vs. performance scalability

Portable Performance is no longer straight forward NP Threads/ MPI_TASK 1(24) 4 6 8 3 16 1 24 1 Doubling Problem Size and Adding 1 node )"!!#$!'% ("&!#$!'% ("!!#$!'% '"&!#$!'% '"!!#$!'% &"!!#$!!%!"!!#$!!%!"#$"%&'(")&*"+#,&"-.

11 Portable Performance is no longer straight forward NP Threads/ MPI_TASK 1(24) Doubling Problem Size and Adding 1 node )"!!#$!'% ("&!#$!'% ("!!#$!'% '"&!#$!'% '"!!#$!'% &"!!#$!!%!"!!#$!!%!"#$"%&'(")&*"+#,&"-./01"2345" *% +% ',% (*% Total MPI Tasks '""# NP Threads/ MPI_TASK 2(48) %""#!""# &""# "# ()*#'# +!,!-# ()#*#.# +!,'-# ()#*#&$# +','-# ()#*#!' +',$-# %""""#!$"""#!!"""#!!"""#!$"""# %""""#

12 Portable Performance is no longer straight forward NP Threads/ MPI_TASK 1(24) 4 6 Trice the Problem Size and 3 nodes )"!!#$!'% ("&!#$!'% ("!!#$!'%!"#$"%&'(")&*"+#,&"-./01"2345" '"&!#$!'% 8 3 '"!!#$!'% &"!!#$!!% !"!!#$!!% *% +% ',% (*% Total MPI Tasks NP Threads/ MPI_TASK 3(72) *""# )""# &""# %""#!""# (""# '""# "# %&"""#!$"""#!!"""#!$"""# %&"""# ,-#%#.(/(0# +,#-#1#.(/%0# +,#-#')#.%/%0# +,#-#(%.%/)0#!!"""#

13 Providing Sustainable and Scalable Performance for ACTS Tools in Multicore Systems Q. Can Performance Scalability be passed from libraries/tools to applications and across platforms and configurations? A. Yes, maybe with a lot of automatic work Fully operational through various parameters and levels of automation Application developers are very reluctant to change their code then preserve in as much as possible current structure of APIs

ACTS Parametric Research and Collaborations Library Installation Run Time job submit options Compile + link Application APPLICATIONS GENERAL

14 ACTS Parametric Research and Collaborations Library Installation Run Time job submit options Compile + link Application APPLICATIONS GENERAL PURPOSE TOOLS Use of ACTS parameters to ensure application scalability (pacts) Pre-Installation PLATFORM SUPPORT TOOLS AND UTILITIES HARDWARE

ACTS Parametric Research and Integration Without pacts Hand-tuning algorithmic parameters can be cumbersome Auto-tuning produces a single tuned library n=max-cores/node APPLICATIONS GENERAL PURPOSE

15 ACTS Parametric Research and Integration Without pacts Hand-tuning algorithmic parameters can be cumbersome Auto-tuning produces a single tuned library n=max-cores/node APPLICATIONS GENERAL PURPOSE TOOLS PLATFORM SUPPORT TOOLS AND UTILITIES With pacts Auto-tune algorithmic parameters (smart-tuning) Auto-tuning produces multiple tuned libraries using steering parameters (#cores/node)} Run-time selection of tuned executables (#cores/node) Some applications won t scale HARDWARE Sustainable & scalable Performance for all applications

16 Multi-Level Tuning to Attain Scalable Performance Optimized dense BLAS kernels Algorithmic optimization Minimize computational costs (storage + ops) Sustain numerical stability and reliability Specialized problem solving techniques Software Implementations: Specialized Data Structures Maximize Load balancing Minimize Latencies, Idle time, etc.. APPLICATIONS GENERAL PURPOSE TOOLS Auto-tuning Smart-tuning Auto-tuning

17 Tuning at Library Installation Level APPLICATIONS Software Resources: Compiler level optimizations Specialized communication libraries and other custom-made support libraries Auto-tuners GENERAL PURPOSE TOOLS PLATFORM SUPPORT TOOLS AND UTILITIES HARDWARE ACTS PARAMETERS Performance Tuning Parameters (PT-pACTS) and Software Dependencies (SD-pACTS): arithmetic and arithmetic precision automatic threading compiler communication libraries and paradigms software requirements

18 Parametrize Optimized Installation of Libraries+Apps Software Resources: Auto-tuners Performance Monitors Functional Performance Parameter Derivation (FP-pACTS) APPLICATIONS GENERAL PURPOSE TOOLS PLATFORM SUPPORT TOOLS AND UTILITIES HARDWARE ACTS PARAMETERS PT-pACTS, SD-pACTS NUMA Aware, Thread, cache, TLB and local store blocking, padding, register and format selection FP-pACTS Output of performance monitoring Optimized library and kernels labeling

19 Runtime Dynamic Selection of Kernels/Libraries APPLICATIONS GENERAL PURPOSE TOOLS Software Resources: Runtime scripts (e.g., job submission scripts) Runtime Parameters impacting application performance (RT-pACTS) Functional Performance Parameter Derivation (FP-pACTS) PLATFORM SUPPORT TOOLS AND UTILITIES HARDWARE ACTS PARAMETERS Smart-tuning tools Choice of ACTS tool(s) and functionality Choice of calling parameters (RT-pACTS) PT-pACTS, SD-pACTS, FP-pACTS and RT-pACTS algorithmic (functional calls) application numerical requirements problem size resource utilization

20 Simple Example of ScaLAPACK LU Library Installation Time SD-pACTS PDGETRF and PDGETRS implementations BLAS DGEMM, DTRSM implementations PT-pACTS Number of cores/node Global Local LAPACK ScaLAPACK PBLAS BLACS OUTPUT: Tuned kernels PBLAS_DEFAULT BLACS_DEFAULT PBLAS_A01V1 BLACS_A01V1 PBLAS_A01V2 BLACS_A01V2 PBLAS_A01V3 BLACS_A01V3 : : PBLAS_A01Vn BLACS_A01Vn BLAS platform specific MPI/PVM/...

21 Simple Example of ScaLAPACK LU SD-pACTS PDGETRF and PDGETRS implementations BLAS DGEMM, DTRSM implementations Application Code Link Time FP-pACTS MPI_TASKS/node Blocking factor APPLICATIONS GENERAL PURPOSE TOOLS PLATFORM SUPPORT TOOLS AND UTILITIES HARDWARE kernel Selector OUTPUT: Tuned kernels PBLAS_DEFAULT PBLAS_A01V1 PBLAS_A01V2 PBLAS_A01V3 : PBLAS_A01Vn BLACS_DEFAULT BLACS_A01V1 BLACS_A01V2 BLACS_A01V3 : BLACS_A01Vn

22 Example of ScaLAPACK LU!"#$"%&'(")&*"+#,&"-./01"2345" )"!!#$!'% ("&!#$!'% ("!!#$!'% '"&!#$!'% '"!!#$!'% &"!!#$!!%!"!!#$!!% *% +% ',% (*% Total MPI Tasks Best in Node Performance SD-pACTS PDGETRF and PDGETRS implementations BLAS Implementation ACML FP-pACTS Matrix 2D Blocking Process Grid RT-pACTS 16x16 Number of cores and Number of nodes Problem size Matrix Blocking Process Grid % of Peak Total MPI Tasks Using RT-pACTS Without RT-pACTS

23 !"#$"%&'(")&*"+#,&"-./01"2345" Example of ScaLAPACK LU )"!!#$!'% ("&!#$!'% ("!!#$!'% '"&!#$!'% '"!!#$!'% &"!!#$!!%!"!!#$!!% *% +% ',% (*% Total MPI Tasks Best in Node Performance SD-pACTS PDGETRF and PDGETRS implementations BLAS Implementation optimized kernels 2-cores FP-pACTS Matrix 2D Blocking Process Grid RT-pACTS Number of cores and Number of nodes Problem size Matrix 2D Blocking NB=8 Process Grid % of Peak MPI_TASKS/node Using RT-pACTS Total MPI Tasks Without RT-pACTS

24 !"#$"%&'(")&*"+#,&"-./01"2345" Example of ScaLAPACK LU )"!!#$!'% ("&!#$!'% ("!!#$!'% '"&!#$!'% '"!!#$!'% &"!!#$!!%!"!!#$!!% *% +% ',% (*% Total MPI Tasks Best in Node Performance SD-pACTS PDGETRF and PDGETRS implementations BLAS Implementation optimized kernels 16-cores FP-pACTS Matrix 2D Blocking Process Grid RT-pACTS Number of cores and Number of nodes 16 MPI_TASKS/node Problem size Matrix 2D Blocking NB=8 Process Grid % of Peak Using RT-pACTS Total MPI Tasks Without RT-pACTS

25 Concluding Remarks HPC centers vs. Installation in your laptop On going-work in parametric research TORCH Kernels Parameter derivation and selection S. Petiton and C. Calvin Current tests used older version of OSKI, hand-tuned kernels and acml (blas) Enlarge the set of auto-tuners Incorporate new ACTS tool developments

Linear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre

Linear Algebra libraries in Debian Who I am? Core developer of Scilab (daily job) Debian Developer Involved in Debian mainly in Science and Java aspects sylvestre.ledru@scilab.org / sylvestre@debian.org