Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012

Size: px

Start display at page:

Download "Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012"

Gillian Webb
6 years ago
Views:

1 Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012

2 Outline NICS and AACE Architecture Overview Resources Native Mode Boltzmann BGK Solver Native/Offload Mode Poisson Solver Heterogeneous MPI Trapezoidal Rule Conclusions

3 National Institute for Computational Sciences Joint Institute for Computational Sciences University of Tennessee & ORNL Funded by the National Science Foundation (NSF) Operates Kraken, a 1.17 PetaFLOP Cray XT5 that is the NSF s most productive supercomputer Major partner in the NSF s Extreme Science and Engineering Discovery Environment (XSEDE) Managed by UT-Battelle for the U.S. Dept. of Energy

4 Application Acceleration Center of Excellence (AACE) Joint Institute for Computational Sciences University of Tennessee & ORNL Established early in 2011 to investigate the application of future computing technologies to simulation in science and engineering An essential element of a sustainable software infrastructure for scientific computing Director: Glenn Brook Managed by UT-Battelle for the U.S. Dept. of Energy

5 AACE Mission To prepare the national supercomputing community to effectively and efficiently utilize future supercomputing architectures To optimize applications for current and future compute systems To develop expertise in the expression and exploitation of fine-grain and medium-grain parallelism To conduct research and education programs focused on developing and transferring knowledge related to emerging computing technologies To provide expert feedback to HPC vendors to guide the development of future supercomputing architectures and programming models

6 NICS-Intel Strategic Engagement Multi-year agreement with Intel to jointly pursue: Development of next-generation, HPC solutions based on the Intel Many Integrated Core (MIC) architecture Design of scientific applications emphasizing a sustainable approach for both performance and productivity NICS receives early access to Intel technologies and provides application testing, performance results, and expert feedback Help guide further development efforts by Intel Help prepare the scientific community to use future HPC technologies immediately upon their deployment

7 Outline NICS and AACE Architecture Overview Resources Native Mode Boltzmann BGK Solver Native/Offload Mode Poisson Solver Heterogeneous MPI Trapezoidal Rule Conclusions

8 Intel MIC Architecture: An Intel Co-Processor Architecture FIXED FUNCTION LOGIC VECTOR IA CORE COHERENT CACHE COHERENT CACHE VECTOR IA CORE VECTOR IA CORE VECTOR IA CORE INTERPROCESSOR NETWORK COHERENT CACHE COHERENT CACHE INTERPROCESSOR NETWORK VECTOR IA CORE!!!! COHERENT CACHE COHERENT CACHE VECTOR IA CORE VECTOR IA CORE COHERENT CACHE COHERENT CACHE VECTOR IA CORE MEMORY and I/O INTERFACES Many cores, and many, many more threads Standard IA programming and memory model Standard networking protocols Source: Kirk Skaugen, ISC 2010 keynote

Core Speed IO Bus Memory Type Memory Size Peak Flops (Single Precision/ Double Precision) Operating System on Card

9 Intel Knights Ferry Technical Specifications Core Count Up to 32 cores Intel Knights Ferry (Intel KNF) is the software development platform (SDP) for the Intel Many Integrated Core (Intel MIC) architecture. Core Speed IO Bus Memory Type Memory Size Peak Flops (Single Precision/ Double Precision) Operating System on Card Networking Capability Up to 1.2 GHz PCIe Gen2 x16 GDDR5 1 or 2 Gigabytes 1229/153 GFLOPS Linux-based IP-Addressable Image Source: Kirk Skaugen, ISC 2010 keynote

Intel Knights Corner Technical Specifications Current KNC deployments utilize pre-production hardware; thus, current performance does not necessarily indicate that of the commercial product.

10 Intel Knights Corner Technical Specifications Current KNC deployments utilize pre-production hardware; thus, current performance does not necessarily indicate that of the commercial product. Intel Knights Corner (Intel KNC) is the first commercial product employing the Intel Many Integrated Core (Intel MIC) architecture. Core Count IO Bus Operating System on Card Networking Capability > 50 cores PCIe Linux-based IP-Addressable

11 Outline NICS and AACE Architecture Overview Resources Native Mode Boltzmann BGK Solver Native/Offload Mode Poisson Solver Heterogeneous MPI Trapezoidal Rule Conclusions

12 Rook Vendor Intel Configuration CPU Family Codename Number of CPUs 2 Standalone workstation Westmere First Intel MIC SDP deployed by NICS Originally 2 KNF Upgraded to 1 KNC Used to explore porting and optimization techniques CPU cores 12 CPU core speed 1.6 GHz Number of KNC 1

Pawn Vendor Intel Configuration CPU Family Codename Number of CPUs 2 Standalone workstation Sandy Bridge Second Intel MIC SDP deployed by NICS Used to

13 Pawn Vendor Intel Configuration CPU Family Codename Number of CPUs 2 Standalone workstation Sandy Bridge Second Intel MIC SDP deployed by NICS Used to explore porting and optimization techniques for single node applications that utilize more than one KNC. CPU cores 12 CPU core speed 1.6 GHz Number of KNC 2

Bishop Cray CX1 1 st Intel MIC cluster at NICS Interactive Intel MIC demo in the ORNL booth at SC11 Vendor Configuration Nodes CPU model Cray 7u modular enclosure 1 head, 2 compute Intel Xeon

14 Bishop Cray CX1 1 st Intel MIC cluster at NICS Interactive Intel MIC demo in the ORNL booth at SC11 Vendor Configuration Nodes CPU model Cray 7u modular enclosure 1 head, 2 compute Intel Xeon 5670 Westmere CPUs per node 2 Cores per CPU 6 CPU core speed RAM per node Intel KNFs per compute node 2.93 GHz 24 GB Cores per Intel KNF 32 Intel KNF core speed RAM per Intel KNF GHz 2 GB

15 Beacon Appro Cluster Funded by the NSF, Beacon is used as a resource for porting and optimizing important scientific codes to the Intel MIC architecture. Beacon will be upgraded to KNCs when they become available. In Summer 2012, an open call will go out to invite research teams to propose projects for work on Beacon. For further information about the Beacon project, please contact PI Glenn Brook (glennbrook@tennessee.edu). Vendor Configuration Nodes Appro 1 42U enclosed rack 2 service, 16 compute CPU model Intel Xeon E CPUs per node 2 Cores per CPU 8 CPU core speed RAM per node Intel KNFs per compute node 2.6 GHz 64 GB (service) 24 GB (compute) Cores per Intel KNF 30 or 32 RAM per Intel KNF 2 2 GB

16 Offload Mode Implementation Offload mode is a very similar operation to the current paradigm offered by GPGPUs. The serial portion of the code is run on the CPU, and portions of the code are either explicitly or implicitly offloaded to the Intel MIC coprocessor. These offload-mode results presented here were obtained using implicit offloads.

17 Native Mode Implementation In this mode, the code and input deck is transferred to the Intel MIC coprocessor, and all computations are done locally. While this saves the overhead of offloading data, the individual core speed of the Intel MIC coprocessor is slower than that of the Intel Xeon processor, which makes the serial portion of the code more significant. Simply compile with mmic and copy the code over for execution.

18 Symmetric Mode Implementation In this mode, the Xeon Phi coprocessor and the Xeon processor are invoked as equals with MPI ranks being distributed amongst cores. A corollary is asymmetric mode, wherein the workshare is not evenly distributed. Intel MPI distrbutes the ranks among the cores through command line arguments. This can also be hybrid OpenMP and MPI.

19 Outline NICS and AACE Architecture Overview Resources Native Mode Boltzmann BGK Solver Native/Offload Mode Poisson Solver Heterogeneous MPI Trapezoidal Rule Conclusions

20 Boltzmann-BGK Solver Point-iterative Newton-Jacobi scheme that employs dual numbers to compute Jacobian contributions directly via Taylor series expansions in the dual space Currently: Single node, OpenMP Optimized by Rob Van Der Wjingaart at Intel Baseline: all major loops are vectorized except for one Known problem: template functions that return values into constructed stack variables More exposed concurrency than available threads Performance results obtained with 2 OpenMP threads per core are presented relative to the above baseline solver

Test on the Intel MIC Architecture Simulate a Couette flow Gas is initially at rest between two infinitely long parallel plates Left plate is stationary while the right plate moves Gas settles into a

21 Test on the Intel MIC Architecture Simulate a Couette flow Gas is initially at rest between two infinitely long parallel plates Left plate is stationary while the right plate moves Gas settles into a steady state solution Verifies a solver s ability to handle solid surfaces and moving boundary conditions u wall = 0m/s T wall = 273.0K Couette Flow! 0 = 9.28!10 "8 kg/m 3 u x0 = 0.0m/s u y0 = 0.0m/s T 0 = 273.0K Kn =1.199 Steady state flow problem using the BGK model Boltzmann equation u wall = 300m/s T wall = 273.0K

22 Solutions Generated by the Intel MIC Coprocessor Steady-state solution of a Couette flow using the Boltzmann equation with BGK collision approximation

23 Optimizations Set I Loop Vectorization Stack variable pulled out of the loop Class member turned into a regular structure Set II Data Access Arrays linearized using macros Align data for more efficient access Set III Parallel Overhead Reduce the number of parallel sections

24 Optimizations Set IV Dependancy Remove reduction from computational loop by saving value into a private variable Set V Precision Use medium precision for math function calls (- fimf-precision=medium) Set VI Precision Use single precision constants and intrinsics Set VII Compiler Hints Use #pragma SIMD instead of #pragma IVDEP

25 Optimization Results Relative Speedup KNC-balanced KNC-scatter Single Precision Loop Vectorization 1 0

26 Scalability of Optimization Set VII 64 (100, 51.1) 32 Parallel Speedup Optimization Set VII - balanced Optimization Set VII - scatter OpenMP Threads

27 Scalability of Optimization Set VII Larger Problem Size Parallel Speedup Scatter OpenMP Threads

28 Boltzmann- BGK Parallel Code Fraction parallel fraction number of cells in each velocity dimension

29 Boltzmann BGK Solver Summary Boltzmann-BGK solver ported and optimized Great strong scaling results across all cores Full vectorization vital for max performance Single precision constants and intrinsics free performance boost Next steps Offload vs. native MPI + OpenMP Scaling across multiple cards Scaling across multiple compute nodes

30 Outline NICS and AACE Architecture Overview Resources Native Mode Boltzmann BGK Solver Native/Offload Mode Poisson Solver Heterogeneous MPI Trapezoidal Rule Conclusions

31 Heat transfer case information Heat transfer over a flat plate to a cool boundary Uses Poisson s Equation (! 2 " = f ) xe y x : 0! 2 y : 0! 1 Source term is Size:, Three structured meshes 600x600, 1200x1200, 2400x2400 Symmetric Gauss-Seidel with SSOR* Decomposed to 1, 15, 30, 60, 90, 120 threads offloaded from one processor *In non-blocking parallel mode, there are some Jacobi iterations

32 Flat Plate Results Initial and Final plots of Temperature Contours (using VisIt) in C

33 Speedup for Offload Mode on KNC

34 Speedup for Native Mode on KNC

35 Summary for Poisson Solver Larger matrices see better speed up Offload time amortized into benefit of multiple threads Native mode is slower as serial fraction increases due to Intel Xeon processor core speed versus Intel MIC coprocessor core speed. Eventually, will use mixed offload and native mode to map to the Intel MIC architecture

36 Outline NICS and AACE Architecture Overview Resources Native Mode Boltzmann BGK Solver Native/Offload Mode Poisson Solver Heterogeneous MPI Trapezoidal Rule Conclusions

Uses of Heterogeneous Mode Have next time step being calculated on the coprocessor(s) while the post-processing and visualization is done on the processor(s) Run border cells on the

37 Uses of Heterogeneous Mode Have next time step being calculated on the coprocessor(s) while the post-processing and visualization is done on the processor(s) Run border cells on the processor(s) while internal cell calculations are run without communication need on the coprocessor(s) Run larger partitions on the processor(s) and smaller partitions on the coprocessors

38 Heterogeneous Mode Trapezoidal Rule Integration Trapezoidal Rule is a method of integration n ( f (x) + f (x +!x)) " #!x 2 i=1 Successively higher discretization results in more accurate answer Embarrassingly parallel, except in summation * %21Trapezoidal_rule_illustration.png

39 Heterogeneous Mode Trapezoidal Rule Integration Attempting to determine correct balance of processor to coprocessor to get best speed up Wait times let user know if computational power is being used to its utmost

40 Heterogeneous Mode Trapezoidal Rule Integration Speed up is similar for both implementations, but there are many wasted clock cycles in the poorly load balanced version of the code.

41 Outline NICS and AACE Architecture Overview Resources Native Mode Boltzmann BGK Solver Native/Offload Mode Poisson Solver Heterogeneous MPI Trapezoidal Rule Conclusions

42 Conclusions Native mode For easy porting For extremely parallel codes that run within the memory available on coprocessors Offload mode For codes with a small number of highly parallel sections that dominate the computation For hybrid MPI-OpenMP codes Heterogeneous mode For networked peers like the majority of MPI-based codes today, providing a migration path for many codes For complete control of computation and communication

43 Contact Vincent Betro, Ph.D. Computational Scientist, Application Acceleration Center of Excellence National Institute for Computational Sciences R. Glenn Brook, Ph.D. Director, Application Acceleration Center of Excellence National Institute for Computational Sciences Ryan Hulguin Intern, Application Acceleration Center of Excellence National Institute for Computational Sciences

Ryan C. Hulguin TACC-Intel Highly Parallel Computing Symposium April 10th-11th, 2012 Austin, TX

Ryan C. Hulguin TACC-Intel Highly Parallel Computing Symposium April 10th-11th, 2012 Austin, TX Outline Introduction Knights Ferry Technical Specifications CFD Governing Equations Numerical Algorithm Solver