Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Similar documents
IN11E: Architecture and Integration Testbed for Earth/Space Science Cyberinfrastructures

A Comprehensive Study on the Performance of Implicit LS-DYNA

The rcuda middleware and applications

University at Buffalo Center for Computational Research

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

Accelerating Implicit LS-DYNA with GPU

TACC s Stampede Project: Intel MIC for Simulation and Data-Intensive Computing

n N c CIni.o ewsrg.au

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

LBRN - HPC systems : CCT, LSU

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Intra-MIC MPI Communication using MVAPICH2: Early Experience

MAGMA. Matrix Algebra on GPU and Multicore Architectures

Pedraforca: a First ARM + GPU Cluster for HPC

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 13 th CALL (T ier-0)

Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Stan Posey, CAE Industry Development NVIDIA, Santa Clara, CA, USA

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Early Experiences Writing Performance Portable OpenMP 4 Codes

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 6 th CALL (Tier-0)

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 11th CALL (T ier-0)

LS-DYNA Performance Benchmark and Profiling. October 2017

Solving Large Complex Problems. Efficient and Smart Solutions for Large Models

Intel Performance Libraries

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Intel Xeon Phi Coprocessors

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Fahad Zafar, Dibyajyoti Ghosh, Lawrence Sebald, Shujia Zhou. University of Maryland Baltimore County

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Challenges of Scaling Algebraic Multigrid Across Modern Multicore Architectures. Allison H. Baker, Todd Gamblin, Martin Schulz, and Ulrike Meier Yang

Solution of Out-of-Core Lower-Upper Decomposition for Complex Valued Matrices

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

Architecture, Programming and Performance of MIC Phi Coprocessor

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)

Performance of Multicore LUP Decomposition

ANSYS HPC. Technology Leadership. Barbara Hutchings ANSYS, Inc. September 20, 2011

Introduction to Xeon Phi. Bill Barth January 11, 2013

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

CME 213 S PRING Eric Darve

Porting Scalable Parallel CFD Application HiFUN on NVIDIA GPU

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

Fast and reliable linear system solutions on new parallel architectures

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Some notes on efficient computing and high performance computing environments

Research on performance dependence of cluster computing system based on GPU accelerators on architecture and number of cluster nodes

ANSYS HPC Technology Leadership

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

HPC Hardware Overview

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Performance Studies for the Two-Dimensional Poisson Problem Discretized by Finite Differences

Parallel Performance Studies for an Elliptic Test Problem on the Cluster maya

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

Performance comparison between a massive SMP machine and clusters

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 16 th CALL (T ier-0)

Multi-Kepler GPU vs. Multi-Intel MIC for spin systems simulations

Splotch: High Performance Visualization using MPI, OpenMP and CUDA

HPC-CINECA infrastructure: The New Marconi System. HPC methods for Computational Fluid Dynamics and Astrophysics Giorgio Amati,

SuperMike-II Launch Workshop. System Overview and Allocations

MAGMA: a New Generation

Dealing with Heterogeneous Multicores

Optimisation Myths and Facts as Seen in Statistical Physics

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Intel Many Integrated Core (MIC) Architecture

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

All routines were built with VS2010 compiler, OpenMP 2.0 and TBB 3.0 libraries were used to implement parallel versions of programs.

PARDISO - PARallel DIrect SOlver to solve SLAE on shared memory architectures

Heterogenous Acceleration for Linear Algebra in Mulit-Coprocessor Environments

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Performance Comparison of a Two-Dimensional Elliptic Test Problem on Intel Xeon Phis

Making Supercomputing More Available and Accessible Windows HPC Server 2008 R2 Beta 2 Microsoft High Performance Computing April, 2010

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

Introduction to the Intel Xeon Phi on Stampede

TFLOP Performance for ANSYS Mechanical

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

Before We Start. Sign in hpcxx account slips Windows Users: Download PuTTY. Google PuTTY First result Save putty.exe to Desktop

Trends in HPC (hardware complexity and software challenges)

Automatic Development of Linear Algebra Libraries for the Tesla Series

Computer Aided Engineering with Today's Multicore, InfiniBand-Based Clusters ANSYS, Inc. All rights reserved. 1 ANSYS, Inc.

Two-Phase flows on massively parallel multi-gpu clusters

HPC Capabilities at Research Intensive Universities

Heterogenous Acceleration for Linear Algebra in Mulit-Coprocessor Environments

International Conference on Computational Science (ICCS 2017)

Double Rewards of Porting Scientific Applications to the Intel MIC Architecture

Preparing for Highly Parallel, Heterogeneous Coprocessing

Transcription:

Procedia Computer Science Volume 51, 2015, Pages 2774 2778 ICCS 2015 International Conference On Computational Science Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Raghavendra Shruti Rao, Dr. Milton Halem and Dr. John Dorband University of Maryland, Baltimore County (UMBC), Maryland, U.S.A rrao1@umbc.edu, halem@umbc.edu, dorband@umbc.edu Abstract This paper examines the performance of advanced computer architectures for large Out-Of-Core matrices to assess the optimal Big Data system configurations., The performance evaluation is based on a large dense Lower-Upper Matrix Decomposition (LUD) employing a highly tuned, I/O managed, slab based LUD software package developed by the Lockheed Martin Corporation. We present extensive benchmark studies conducted with this package on UMBC s Bluegrit and Bluewave clusters, and NASA-GFSC s Discover cluster systems. Our results show the speedup for a single node achieved by Phi Co-Processors relative to the host CPU SandyBridge processor is about a 1.5X improvement, which is an even smaller relative performance gain compared with the studies by F.Masci where he obtains a 2-2.5x performance. Surprisingly, the Westmere with the Tesla GPU scales comparably with the Sandy Bridge and the Phi Co-Processor up to 12 processes and then fails to continue to scale. The performances across 20 CPU nodes of SandyBridge obtains a uniform speedup of 0.5X over Westmere for problem sizes of 10K, 20K and 40K unknowns. With an Infiniband DDR, the performance of Nehalem processors is comparable to Westmere without the interconnect. Keywords: Matrix Multiplication, Out-Of-Core Matrices, Hybrid, Phi Co Processors, Tesla GPU 1 Introduction Matrices form the basis of Big Data Analytics involving stock control, cost, revenue and profit calculations. However, when these dense matrices grow so large that they cannot fit in the main memory of computer clusters, one must either change the mathematical algorithm or use secondary memory (e.g., Disk). As a result, new methods for managing the I/O flow of these data are being developed which, along with advances in hybrid computing architectures, offer promising capabilities for addressing the challenges of Big Data analytics. 2774 Selection and peer-review under responsibility of the Scientific Programme Committee of ICCS 2015 c The Authors. Published by Elsevier B.V. doi:10.1016/j.procs.2015.05.432

Performance evaluation is of primary interest when computer systems are designed, selected, or improved (Ferrari, 1972). We provide performance evaluations on various HPC platforms available to the NSF sponsored Center for Hybrid Multicore Productivity Research at UMBC as well as those at NASA Centre for Climate Simulation at the Goddard Space Flight Center (GFSC). We employ a highly tuned, I/O managed, and slab based LU Decomposition software package developed by the Lockheed Martin Corporation for systems employing MICs and GPUs. We perform our evaluations using the Lower-Upper (LU) factorization or commonly referred to as the decomposition version of Gauss elimination, the most popular method for solving large dense systems of equations. 2 Related Work In (Dongarra J. a., 2013), they present several fundamental dense linear algebra (DLA) algorithms for multicore with Xeon Phi Coprocessors and give an overview of MAGMA MIC (MAGMA MIC 1.2 Beta, 2014). F. Masci performs a preliminary benchmarking of a single Xeon Phi card in (Masci, 2013). Two of the three tests used for benchmarking dealt with multiplication of two large matrices of size 3000 * 3000. One implementation used explicit looping of matrix elements where loops were multithreaded using OpenMP pragmas and ran in native-mic, MIC-offload, and Host-exclusive mode. The other implementation used highly (Phi-optimized) Math Kernel Library (MKL) to perform the matrix calculations and OpenMP to assist with multithreading. His results show that with Xeon Phi gained a speedup factor of 2-2.5 compared with an Intel Xeon E5 processor with 16 2.6 GHz cores. He opines this speed up isn t worth the cost, if the application spends most of the time in moving data off/onto disks and/or in/out of RAM. Nonetheless, he concurs some scientists may benefit from the Xeon Phi by running compute intensive customized code, e.g., N-body simulations. 3 System Architecture Overview We performed our evaluations on three different systems. First, we chose the Linux cluster, discover, at the NASA Center for Computational Sciences (NCCS) with an IBM idataplex with 480 Intel Many Integrated Core (Phi) co-processors. The host processor consists of dual 2.6 GHz Intel Xeon Sandy Bridge processor, each having 2 oct-core processors per node. The Interconnect is the Quad data rate Infiniband (QDR). Discover also has a set of nodes configured with Graphical Processing Units (GPUs). We ran our test on a single Tesla GPU (M2070) with 1 PCIe x16 Gen2 system interface hosted on 2 Hex-core 2.8 GHz Intel Xeon Westmere Processor. The Interconnect used in this SCU is the Dual Data Rate Infiniband (DDR). Second, the Bluegrit cluster hosted at University of Maryland, Baltimore County (UMBC) is a distributed memory consisting of 47 Power PC nodes, one Power 7 node, 13 Intel Nehalem blades and 7 Intel Westmere blades. We used the following nodes: A single Intel(R) Xeon(R) CPU E5504 (Nehalem) running at 2.00GHz, a single Intel(R) Xeon(R) CPU X5670 (Westmere) running at 2.93GHzand a IBM blade cluster with dual 6-core Intel Xeon X5670 (Westmere) blades with 24 GB of system memory and with an attached NVIDIA Tesla M2070 GPU, which has 448 cores and 6GB graphics memory. And finally, the Linux based distributed-memory cluster Bluewave, also hosted at UMBC, consists of 160 IBM idataplex compute nodes, each having 2 quad core Intel(R) Xeon(R) Nehalem X5560 running at 2.80GHz. Infiniband DDR is used as the interconnect. 2 2775

3 Benchmark Application Slab Lower-Upper Decomposition (SlabLUD), the code used for this study, is a research code maintained by the Lockheed Martin Corporation (LMCO) Centralized Super Computer Facility (CSCF). It is written in Fortran95, has separate binaries as part of the benchmark: matrix_fill, matrix_decomp, and matrix_solve. The binary matrix_fill writes a matrix filled with random values to disk, using the I/O library function calls. The binary matrix_decomp factors the LHS Matrix A using LU Decomposition and stores it in while the binary matrix_solve checks for the correctness of solution. During the solve phase, the solution vectors overwrite the RHS, that is Matrix B. When data is too big to fit in the main memory, the I/O library is used to store matrices and right hand sides (RHS) on disk in slab format and utilizes a triple buffer in order to improve out-of-core performance. The triple buffer system overlaps disk I/O with computations, while two slabs are being operated and the next slab is being transferred from disk. A GPU version of SlabLUD code has been developed by T.Blattner (Blattner, 2013). He extends the existing I/O algorithm to include the buffering of data from the compute unit memory to the accelerator memory (through the PCI Express) allowing the algorithm to seamlessly utilize accelerator technology and its available GPU library support. 4 Performance Results Figure 5.1: Average Compute Time per Processor on one Node, on a 10K * 10K matrix Figure 5.1 above shows average compute time per processor for 10K unknowns for a slab size of 100 MB on various architectures. The dimensions of right hand side matrix (B) is same as that of a slab, typically 100 MB or 327 columns for 10,000 unknowns. The two left dots on the top left corner indicate the time taken by the GPUs. Although both the GPU computing modules were the same, the Tesla M2070 on Bluegrit took 448 seconds as against Discover s 764 seconds. This Bluegrit GPU increased performance may be due to the fact that Bluegrit s IBM/GPU blade is connected directly to the Westmere CPU blade. Nehalem processor on Bluewave is 30% faster than that of Bluegrit. Further, the time taken on NASA s Westmere on is 4 times faster than on Bluegrit. This is made possible by the presence of Infiniband DDR as an interconnect in Bluewave and Discover systems. Xeon C5 2670 Sandy Bridge Processor and the Phi Coprocessors deliver 2.5x and 3x-5x speedup respectively, over Westmere. However, it was noted that the automatic offload does not take place, unless the slab size is increased to 4000 columns or 1220 MB. A speedup of 1.5X is observed when Phi Coprocessors are compared to the host SandyBridge, which is slightly less than the speedup observed by Masci in (Masci, 2013). The Phi Coprocessor outperforming Fermi Architecture GPU in terms of computing power indicates that the MKL library is far more optimized and well suited for computationally intensive problems. 2776 3

Figure 5.2: Scalability analysis of different processor on one Node, using a 10K * 10K matrix When the results are analyzed for their scalability as the number of processes (Fig 5.2), it can be observed that the Phi Coprocessors and SandyBridge scale very close to the ideal behavior. The 2 Nehalem (bottom 2 lines) display very similar performance, but with increased scalability on Bluewave. The same observation can be made about Westmere (2 curves at the center), where the scalability factor is 0.6-0.75x times more on Discover. Figure 5.3 below shows the breakdown of the performance of 20 Westmere nodes for different problem sizes. We see that for 10K, the total wall time continues to scale until 12 processes. Additionally, for 20K, we observe scaling taking place until 32 processes and up to 64 processes for 40K unknowns. Thus, we can infer that for a problem size of 80K unknowns utilizing 20 nodes, the peak performance will be observed 128 processes. Figure 5.3 Figure 5.4 Similarly, we can observe in figure 5.4 that the Sandy Bridge continues to improve by a factor of 2 up to 8 processes and then performance degrades after 64 processes. For 80K, we expect to see the peak performance with 128 processes on 20 Nodes. Figure 5.5 below depicts total wall clock time taken by Sandy Bridge and Westmere on Discover, to decompose a matrix of 10K, 20K and 40K unknowns on 20 nodes. We find that for 10K unknowns, the peak performance for both Sandy Bridge and Westmere is observed with 12 processes with a relative speed up of 0.5X for the former. When the problem size is increased to 20K, the peak for the 20 node processors is observed at 32 processes, again with a speedup of 0.5X. Further, the same speedup is observed when the compare the peak performance for 40K unknowns (64 processes). Thus, we can conclude that Sandy Bridge has a uniform speedup of 0.5X over Westmere for the problem sizes defined above. 4 2777

Figure 5.5: Total Wall Time SandyBridge and Westmere on 20 Nodes, for 10K, 20K and 40K unknowns 6 Conclusions In this paper, we presented the performance evaluation of the I/O managed Slab LUD code on Nehalem, Westmere, Sandy Bridge and NVIDIA Tesla M2070 GPU and the PHI processors. The results show that the performance impact is most strongly correlated with the memory interconnect. Westmere scales slightly better than Sandy Bridge after about 8 processes for matrices that are larger than 10K* 10K but is 50% slower than Sandy Bridge. Bluegrit GPU performance with Westmere blade 2x faster than same Discover GPU with Westmere despite lack of Infiniband. On Multiple nodes, Sandy Bridge has a uniform speedup of 0.5X over Westmere for problem sizes of 10K, 20K and 40K unknowns. Performance on Sandy Bridge, with 40K unknown using 64 processes, is better than with 20K unknowns using 1 or 2 processes. Phi processors with Sandy Bridge offer a 3-fold performance increase over Westmere but do not auto off load unless compiler indicates it will improve performance. Our results show better performance for Sandy Bridge relative to the Phi Processor than reported by Masci (Masci, 2013). The speedup achieved by Phi over host Sandy Bridge is 1.5X compared with his 2X-2.5X speedup. We agree with Masci that Sandy Bridge processors more cost efficient than adding Phi processors. 7 Acknowledgements This work was supported by Centre for Hybrid Multicore Productivity and Research (CHMPR) grant NSF- 0934364. The authors would also to thank NASA Centre for Climate Simulation for access to the Discover system and Lockheed Martin Corporation for providing the Slab LUD software. 8 References Blattner, T. (2013). A Hybrid CPU/GPU Pipeline Workflow System. ProQuest Dissertations and Theses,80. Dongarra, J. et. al. (2013). Portable HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi. Masci, F. (2013). Benchmarking the Intel Xeon Phi Coprocessor. MAGMA MIC 1.2 Beta. (2014). Retrieved from http://icl.cs.utk.edu/magma/software/ 2778 5