High-Performance Computational Electromagnetic Modeling Using Low-Cost Parallel Computers

Similar documents
Cost-Effective Parallel Computational Electromagnetic Modeling

THE application of advanced computer architecture and

Advanced Surface Based MoM Techniques for Packaging and Interconnect Analysis

Second Conference on Parallel, Distributed, Grid and Cloud Computing for Engineering

Turbostream: A CFD solver for manycore

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science

Tools and Primitives for High Performance Graph Computation

Early Evaluation of the Cray X1 at Oak Ridge National Laboratory

GPU Acceleration of Matrix Algebra. Dr. Ronald C. Young Multipath Corporation. fmslib.com

Big Orange Bramble. August 09, 2016

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography

HFSS Hybrid Finite Element and Integral Equation Solver for Large Scale Electromagnetic Design and Simulation

Scan Primitives for GPU Computing

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency

QR Decomposition on GPUs

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

A MATLAB Interface to the GPU

Large-scale Gas Turbine Simulations on GPU clusters

APPLICATIONS ON HIGH PERFORMANCE CLUSTER COMPUTERS Production of Mars Panoramic Mosaic Images 1

LINPACK Benchmark. on the Fujitsu AP The LINPACK Benchmark. Assumptions. A popular benchmark for floating-point performance. Richard P.

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

2 Fundamentals of Serial Linear Algebra

High Performance Computing: Tools and Applications

Ed D Azevedo Oak Ridge National Laboratory Piotr Luszczek University of Tennessee

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply

Parallel Sparse LU Factorization on Different Message Passing Platforms

Lecture 13: March 25

HPC Algorithms and Applications

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager

COSC6365. Introduction to HPC. Lecture 21. Lennart Johnsson Department of Computer Science

Optimization of HOM Couplers using Time Domain Schemes

Parallel Combinatorial BLAS and Applications in Graph Computations

Parallel solution for finite element linear systems of. equations on workstation cluster *

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

A Test Suite for High-Performance Parallel Java

Parallel FEM Computation and Multilevel Graph Partitioning Xing Cai

Parallelism in Spiral

Numerical Algorithms

6.1 Multiprocessor Computing Environment

CHAO YANG. Early Experience on Optimizations of Application Codes on the Sunway TaihuLight Supercomputer

1.2 Numerical Solutions of Flow Problems

A GPU Sparse Direct Solver for AX=B

Lecture 7: Parallel Processing

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Porting and Optimizing the COSMOS coupled model on Power6

Iterative Refinement on FPGAs

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

What are Clusters? Why Clusters? - a Short History

Optimising MPI Applications for Heterogeneous Coupled Clusters with MetaMPICH

High Performance Computing for PDE Towards Petascale Computing

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Lecture 3: Intro to parallel machines and models

Two main topics: `A posteriori (error) control of FEM/FV discretizations with adaptive meshing strategies' `(Iterative) Solution strategies for huge s

Helsinki University of Technology Laboratory of Applied Thermodynamics. Parallelization of a Multi-block Flow Solver

Vector an ordered series of scalar quantities a one-dimensional array. Vector Quantity Data Data Data Data Data Data Data Data

Performance and accuracy of hardware-oriented native-, solvers in FEM simulations

Measuring Performance. Speed-up, Amdahl s Law, Gustafson s Law, efficiency, benchmarks

Iterative methods for use with the Fast Multipole Method

Comparison of XT3 and XT4 Scalability

Introduction to parallel Computing

Optimization of Lattice QCD codes for the AMD Opteron processor

HPCC Results. Nathan Wichmann Benchmark Engineer

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Blocking Optimization Strategies for Sparse Tensor Computation

Introduction to Parallel Programming for Multicore/Manycore Clusters Part II-3: Parallel FVM using MPI

Lecture 12: Instruction Execution and Pipelining. William Gropp

Techniques for Optimizing FEM/MoM Codes

Integrating ADS into a High Speed Package Design Process

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report

A Parallel Solver for Laplacian Matrices. Tristan Konolige (me) and Jed Brown

A Kernel-independent Adaptive Fast Multipole Method

Computer Architectures and Aspects of NWP models

From Beowulf to the HIVE

IMPLEMENTATION OF THE. Alexander J. Yee University of Illinois Urbana-Champaign

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

PARALLELIZATION OF POTENTIAL FLOW SOLVER USING PC CLUSTERS

AMath 483/583 Lecture 21 May 13, 2011

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Early Experiences with the Naval Research Laboratory XD1. Wendell Anderson Dr. Robert Rosenberg Dr. Marco Lanzagorta Dr.

Composite Metrics for System Throughput in HPC

Master Informatics Eng.

Lecture 7: Parallel Processing

Q. Wang National Key Laboratory of Antenna and Microwave Technology Xidian University No. 2 South Taiba Road, Xi an, Shaanxi , P. R.

Asynchronous OpenCL/MPI numerical simulations of conservation laws

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains:

ANALYSIS OF CLUSTER INTERCONNECTION NETWORK TOPOLOGIES

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015

Massively Parallel Finite Element Simulations with deal.ii

Parallel Data Mining on a Beowulf Cluster

BlueGene/L. Computer Science, University of Warwick. Source: IBM

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

Transcription:

High-Performance Computational Electromagnetic Modeling Using Low-Cost Parallel Computers July 14, 1997 J Daniel S. Katz (Daniel.S.Katz@jpl.nasa.gov) Jet Propulsion Laboratory California Institute of Technology Pasadena, CA 9119

Beowulf System (Hyglac) u 16 PentiumPro PCs, each with 2.5 Gbyte disk, 128 Mbyte memory, Fast Ethernet card. u Connected using 1Base-T network, through a 16-way crossbar switch. u Theoretical peak performance: 3.2 GFlop/s. u Achieved sustained performance: 1.26 GFlop/s. J Daniel S. Katz (July 14, 1997) 2

Hyglac Cost u Hardware cost: $54,2 (9/18/96) 16 (CPU, disk, memory) 1 (16-way crossbar, monitor, keyboard, mouse) u Software cost: $ (+ maintenance) Public Domain OS, compilers, tools, libraries. u 256 PE T3D: $8 million (1/94) including all software and maintenance for 3 years. u 16 PE T3E: $1 million 3 to 4 times more powerful then T3D. J Daniel S. Katz (July 14, 1997) 3

Hyglac Performance (vs. T3D) Hyglac (MPI) T3D (MPI) T3D (shmem) CPU Speed (MHz) 2 15 15 Peak Rate (MFlop/s) 2 15 15 L1, L2 Cache Size 8i+8d, 256 8i+8d, 8i+8d, (Kbytes) Memory Bandwidth.78 1.4 1.4 (Gbit/s) Communication 15 35 1.8 Latency (µs) Communication Bandwidth (Mbit/s) 66 225 28-97 u Next, examine performance of EMCC FDTD and PHOEBUS (FE) codes... J Daniel S. Katz (July 14, 1997) 4

FDTD Interior Communication Standard Domain Decomposition Required Ghost Cells Interior Cells Ghost Cells One plane of ghost cells must be communicated to each neighboring processor each time step. J Daniel S. Katz (July 14, 1997) 5

FDTD Boundary Communication Standard Domain Decomposition Boundary Decomposition Extra work required along faces (boundary conditions, wave source, far-field data locus). Different decomposition required for good load balance. Data at 4 faces must be redistributed twice each time step!! J Daniel S. Katz (July 14, 1997) 6

FDTD Timing T3D (shmem) T3D (MPI) Hyglac (MPI, Good Load Balance) Hyglac (MPI, Poor Load Balance) Interior Computation 1.8 (1.3 * ) 1.8 (1.3 * ) 1.1 1.1 Interior Communication.7.8 3.8 3.8 Boundary Computation.19.19.14.42 Boundary.4 1.5 5.1. Communication Total 2. (1.5 * ) 3.5 (3. * ) 55.1 5.5 ( * using assembler kernel) All timing data is in CPU seconds/simulated time step, for a global grid size of 282 362 12, distributed on 16 processors. J Daniel S. Katz (July 14, 1997) 7

FDTD Timing, cont. u Computation: Hyglac CPU is 35-65% faster than T3D CPU. u Communication: T3D: MPI is 4 to 9 times slower than shmem. Hyglac MPI is 3-5 times slower than T3D MPI. u Good (or even acceptable) performance may require rewriting/modifying code. J Daniel S. Katz (July 14, 1997) 8

PHOEBUS Coupled Formulation For three unknowns, E s, H s E = Es + Ei H = Hs + H i E i, H i H, J, M The following three equations must be solved : V V 1 1 2 ( T) ( H) k T µ r H dv + T M ds = jωε V ε r V (finite element equation) H M J ˆn V [ ] = nˆ U nˆ H J ds (essential boundary condition) [ ]+ [ ]= Z M Z J V e h i (combined field integral equation) J Daniel S. Katz (July 14, 1997) 9

J Daniel S. Katz (July 14, 1997) 1 PHOEBUS Coupled Equations u This matrix problem is filled and solved by PHOEBUS. The K submatrix is a sparse finite element matrix. The Z submatrices are integral equation matrices. The C submatrices are coupling matrices between the FE and IE matrices. K C C Z Z Z H M J V M J inc =

J Daniel S. Katz (July 14, 1997) 11 PHOEBUS Two Step Method u Find -C K -1 C using QMR on each row of C, building x rows of K -1 C, and multiplying with C. u Solve reduced system as a dense matrix. u If required, save K -1 C to solve for H. K C C Z Z Z H M J V M J = = CK C Z Z Z M J V M J 1 H K CM = 1

PHOEBUS Decomposition u Matrix Decomposition Assemble complete matrix. Reorder to equalize row bandwidth.» Gibbs-Poole-Stockmeyer» SPARSPAK s GENRCM Partition matrix in slabs or blocks. Each processor receives slab of matrix elements. Solve matrix equation. J Daniel S. Katz (July 14, 1997) 12

PHOEBUS Matrix Reordering Original System System after Reordering for Minimum Bandwidth Using SPARSPAK s GENRCM Reordering Routine J Daniel S. Katz (July 14, 1997) 13

PHOEBUS Matrix-Vector Multiply Communication from processor to left Rows Columns Local processor s rows X Local processor s rows Communication from processor to right Local processor s rows J Daniel S. Katz (July 14, 1997) 14

PHOEBUS Solver Timing Model: dielectric cylinder with 43,791 edges, radius = 1 cm, height = 1 cm, permittivity = 4., at 5. GHz Matrix-Vector Multiply Computation Matrix-Vector Multiply T3D (shmem) T3D (MPI) Hyglac (MPI) 129 129 59 114 272 326 Communication Other Work 47 415 136 Total 18 198 522 Time of Convergence (CPU seconds), solving using pseudo-block QMR algorithm for 116 right hand sides. J Daniel S. Katz (July 14, 1997) 15

PHOEBUS Solver Timing, cont. u Computation: Hyglac CPU works 55% faster than T3D CPU. For sparse arithmetic, large secondary cache provides substantial performance gain. u Communication: T3D MPI is 2.5 times slower than T3D shmem. Hyglac MPI is 1 times slower than T3D MPI. The code was rewritten to use manual packing and unpacking, to combine 1 small messages into 1 medium-large message. J Daniel S. Katz (July 14, 1997) 16

PHOEBUS Solver Timing, cont. u Other work: Mostly BLAS1-type operations (dot, norm, scale, vector sum, copy)» Heavily dependent on memory bandwidth. Also, some global sums» Over all PEs.» Length: number of RHS/block complex words. 3.5 times slower than T3D. J Daniel S. Katz (July 14, 1997) 17

Conclusions u If message sizes are not too small, adequate communication performance is possible. u Computation rates are good, provided there is adequate data reuse. u Low-cost parallel computers can provide better performance/$ than traditional parallel supercomputers, for limited problem sizes, and with some code rewriting. J Daniel S. Katz (July 14, 1997) 18

Conclusions, cont. u Both EMCC FDTD code and PHOEBUS QMR solver should scale to larger numbers of processors easily and well, since most communication is to neighbors. u User experience has been mostly good: Lack of both support and documentation can be frustrating. Easy to use, once you know how use it. J Daniel S. Katz (July 14, 1997) 19