Cost-Effective Parallel Computational Electromagnetic Modeling

Similar documents
High-Performance Computational Electromagnetic Modeling Using Low-Cost Parallel Computers

Advanced Surface Based MoM Techniques for Packaging and Interconnect Analysis

THE application of advanced computer architecture and

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency

HFSS Hybrid Finite Element and Integral Equation Solver for Large Scale Electromagnetic Design and Simulation

Performance Comparison between Blocking and Non-Blocking Communications for a Three-Dimensional Poisson Problem

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science

CSE 590: Special Topics Course ( Supercomputing ) Lecture 6 ( Analyzing Distributed Memory Algorithms )

Lecture 15: More Iterative Ideas

What are Clusters? Why Clusters? - a Short History

Powerful features (1)

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

A Test Suite for High-Performance Parallel Java

Commodity Cluster Computing

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

Numerical Algorithms

Q. Wang National Key Laboratory of Antenna and Microwave Technology Xidian University No. 2 South Taiba Road, Xi an, Shaanxi , P. R.

Phased Array Antennas with Optimized Element Patterns

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19

From Beowulf to the HIVE

2D & 3D Finite Element Method Packages of CEMTool for Engineering PDE Problems

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Basic Communication Operations (Chapter 4)

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

smooth coefficients H. Köstler, U. Rüde

Finite Difference Time Domain (FDTD) Simulations Using Graphics Processors

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Laplace Exercise Solution Review

COSC6365. Introduction to HPC. Lecture 21. Lennart Johnsson Department of Computer Science

APPLICATIONS ON HIGH PERFORMANCE CLUSTER COMPUTERS Production of Mars Panoramic Mosaic Images 1

Techniques for Optimizing FEM/MoM Codes

SIMULATION OF AN IMPLANTED PIFA FOR A CARDIAC PACEMAKER WITH EFIELD FDTD AND HYBRID FDTD-FEM

A Graphical User Interface (GUI) for Two-Dimensional Electromagnetic Scattering Problems

Numerical methods in plasmonics. The method of finite elements

Parallel FEM Computation and Multilevel Graph Partitioning Xing Cai

Reconfigurable Hardware Implementation of Mesh Routing in the Number Field Sieve Factorization

Parallel solution for finite element linear systems of. equations on workstation cluster *

Parallel Poisson Solver in Fortran

Block Lanczos-Montgomery Method over Large Prime Fields with GPU Accelerated Dense Operations

LINPACK Benchmark. on the Fujitsu AP The LINPACK Benchmark. Assumptions. A popular benchmark for floating-point performance. Richard P.

1.2 Numerical Solutions of Flow Problems

Unit 9 : Fundamentals of Parallel Processing

Optimum Placement of Decoupling Capacitors on Packages and Printed Circuit Boards Under the Guidance of Electromagnetic Field Simulation

HPC Algorithms and Applications

Optimising MPI Applications for Heterogeneous Coupled Clusters with MetaMPICH

GPU Acceleration of Matrix Algebra. Dr. Ronald C. Young Multipath Corporation. fmslib.com

Exploring unstructured Poisson solvers for FDS

cuibm A GPU Accelerated Immersed Boundary Method

The Power of Streams on the SRC MAP. Wim Bohm Colorado State University. RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

Turbostream: A CFD solver for manycore

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

Solution of Out-of-Core Lower-Upper Decomposition for Complex Valued Matrices

Optimization of FEM solver for heterogeneous multicore processor Cell. Noriyuki Kushida 1

6.1 Multiprocessor Computing Environment

Performance and Software-Engineering Considerations for Massively Parallel Simulations

Second Conference on Parallel, Distributed, Grid and Cloud Computing for Engineering

Performance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture. Alexander Berreth. Markus Bühler, Benedikt Anlauf

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

Parallel Numerics, WT 2013/ Introduction

An Introduction to the Finite Difference Time Domain (FDTD) Method & EMPIRE XCcel

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Helsinki University of Technology Laboratory of Applied Thermodynamics. Parallelization of a Multi-block Flow Solver

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

Application Performance on Dual Processor Cluster Nodes

A Diagonal Split-cell Model for the High-order Symplectic FDTD Scheme

A Parallel Implementation of the BDDC Method for Linear Elasticity

Effect of memory latency

CHAPTER 2 NEAR-END CROSSTALK AND FAR-END CROSSTALK

Vector an ordered series of scalar quantities a one-dimensional array. Vector Quantity Data Data Data Data Data Data Data Data

User Manual. FDTD & Code Basics

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

High-Frequency Algorithmic Advances in EM Tools for Signal Integrity Part 1. electromagnetic. (EM) simulation. tool of the practic-

Blocking SEND/RECEIVE

A Kernel-independent Adaptive Fast Multipole Method

Parallel Programming Patterns

Programming as Successive Refinement. Partitioning for Performance

The Application of the Finite-Difference Time-Domain Method to EMC Analysis

Large-scale Gas Turbine Simulations on GPU clusters

Dual Polarized Phased Array Antenna Simulation Using Optimized FDTD Method With PBC.

Lab 1: Microstrip Line

Introduction to parallel Computing

PARALLELIZATION OF POTENTIAL FLOW SOLVER USING PC CLUSTERS

Simulation of Optical Waves

Introduction to Parallel Programming for Multicore/Manycore Clusters Part II-3: Parallel FVM using MPI

Application of GPU-Based Computing to Large Scale Finite Element Analysis of Three-Dimensional Structures

QR Decomposition on GPUs

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Two main topics: `A posteriori (error) control of FEM/FV discretizations with adaptive meshing strategies' `(Iterative) Solution strategies for huge s

RCS Measurement and Analysis of Rectangular and Circular Cross-section Cavities

Comparison of TLM and FDTD Methods in RCS Estimation

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography

AMath 483/583 Lecture 21 May 13, 2011

Summer 2009 REU: Introduction to Some Advanced Topics in Computational Mathematics

Contents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

The Role of Performance

Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace

Transcription:

Cost-Effective Parallel Computational Electromagnetic Modeling, Tom Cwik {Daniel.S.Katz, cwik}@jpl.nasa.gov

Beowulf System at PL (Hyglac) l 16 Pentium Pro PCs, each with 2.5 Gbyte disk, 128 Mbyte memory, Fast Ethernet card. l Connected using 1Base-T network, through a 16-way crossbar switch. l Theoretical peak: 3.2 GFLOP/s l Sustained: 1.26 GFLOP/s

Beowulf System at Caltech (Naegling) l ~12 Pentium Pro PCs, each with 3 Gbyte disk, 128 Mbyte memory, Fast Ethernet card. l Connected using 1Base-T network, through two 8-way switches, connected by a 4 Gbit/s link. l Theoretical peak: ~24 GFLOP/s l Sustained: 1.9 GFLOP/s

Hyglac Cost l Hardware cost: $54,2 (as built, 9/96) $22, (estimate, 4/98)» 16 (CPU, disk, memory, cables)» 1 (16-way switch, monitor, keyboard, mouse) l Software cost: $6 ( + maintainance)» Absoft Fortran compilers (should be $9)» NAG F9 compiler ($6)» public domain OS, compilers, tools, libraries

Naegling Cost l Hardware cost: $19, (as built, 9/97) $154, (estimate, 4/98)» 12 (CPU, disk, memory, cables)» 1 (switch, front-end CPU, monitor, keyboard, mouse) l Software cost: $ ( + maintainance)» Absoft Fortran compilers (should be $9)» public domain OS, compilers, tools, libraries

Performance Comparisons (Communication results are for MPI code) Hyglac Naegling T3D T3E6 CPU Speed (MHz) 2 2 15 3 Peak Rate (MFLOP/s) 2 2 3 6 Memory (Mbyte) 128 128 64 128 Communication Latency (µs) Communication Throughput (Mbit/s) 15 322 35 18 66 78 225 12

Message-Passing Methodology l Issue (non-blocking) receive calls: CALL MPI_IRECV(...) l Issue (synchronous) send calls: CALL MPI_SSEND(...) l Issue (blocking) wait calls (wait for receives to complete): CALL MPI_WAIT(...)

Finite-Difference Time-Domain Application Images produced at U of Colorado s Comp. EM Lab. by Matt Larson using SGI s LC FDTD code Time steps of a gaussian pulse, travelling on a microstrip, showing coupling to a neighboring strip, and crosstalk to a crossing strip. Colors showing currents are relative to the peak current on that strip. Pulse: rise time = 7 ps, freq. to 3 GHz. Grid dimensions = 282 362 12 cells. Cell size = 1 mm 3.

FDTD Algorithm l Classic time marching PDE solver l Parallelized using 2-dimensional domain decomposition method with ghost cells. Standard Domain Decomposition Required Ghost Cells Interior Cells Ghost Cells

FDTD Algorithm Details l Uses Yee s staggered grid l Time Stepping Loop:» Update Electric Fields (three 5-point stencils, on x-y, x-z, y-z planes)» Update Magnetic Fields (three 5-point stencils, on x-y, x-z, y-z planes)» Communicate Magnetic Fields to ghost cells of neighboring processors (in x and y)

FDTD Results Number of Naegling T3D T3E-6 Processors 1 2.44 -. 2.71 -..851 -. 4 2.46 -.97 2.79 -.26.859 -.19 16 2.46 -.21 2.79 -.24.859 -.51 64 2.46 -.32 2.74 -.76.859 -.52 Time (wall clock seconds / time step), scaled problem size (69 69 76 cells / processor), times are: computation - communication

FDTD Conclusions l Naegling and Hyglac produce similar results for 1 to 16 processors l Scaling from 16 to 64 processors is quite reasonable l On all numbers of processors, Beowulfclass computers perform similarly to T3D, and worse than T3E, as expected.

PHOEBUS Choke Ring Finite Element Region Integral Equation Boundary Radiation Pattern from PL Circular Waveguide (from C. Zuffada, et. al., IEEE AP-S paper 1/97) Integral Equation Boundary Finite Element Region Typical Applications: Radar Cross Section of a dielectric cylinder

PHOEBUS Coupled Equations l This matrix problem is filled and solved by PHOEBUS» The K submatrix is a sparse finite element matrix» The Z submatrices are integral equation matrices.» The C submatrices are coupling matrices between the FE and IE matrices. K C C Z Z Z H M V M inc =

PHOEBUS Solution Process K C C Z Z Z H M V M = = CK C Z Z Z M V M 1 H K CM = 1 l Find -C K -1 C using QMR on each row of C, building x rows of K -1 C, and multiplying with -C. l Solve reduced system as a dense matrix.

PHOEBUS Algorithm l Assemble complete matrix l Reorder to minimize and equalize row bandwidth of K l Partition matrices in slabs l Distribute slabs among processors l Solve sparse matrix equation (step 1) l Solve dense matrix equation (step 2) l Calculate observables

PHOEBUS Matrix Reordering Original System System after Reordering for Minimum Bandwidth Non-zero structure of matrices, using SPARSPAK s GENRCM Reordering Routine

PHOEBUS Matrix-Vector Multiply Communication from processor to left Rows Columns Local processor s rows X Local processor s rows Communication from processor to right Local processor s rows

PHOEBUS Solver Timing Model: dielectric cylinder with 43,791 edges, radius = 1 cm, height = 1 cm, permittivity = 4., at 5. GHz Number of Processors T3D (shmem) T3D (MPI) Naegling (MPI) Matrix-Vector Multiply 129 129 152 Computation Matrix-Vector Multiply 114 272 172 Communication Other Work 47 415 1211 Total 18 198 4433 Time of Convergence (CPU seconds), solving using 16 processors, pseudo-block QMR algorithm for 116 right hand sides.

PHOEBUS Solver Timing Model: dielectric cylinder with 1,694 edges, radius = 1 cm, height = 1 cm, permittivity = 4., at 5. GHz Number of Processors T3D (shmem) T3D (MPI) Naegling (MPI) Matrix-Vector Multiply 868 919 134 Computation Matrix-Vector Multiply 157 254 259 Communication Other Work 323 323 923 Total 1348 1496 416 Time of Convergence (CPU seconds), solving using 64 processors, pseudo-block QMR algorithm for 116 right hand sides.

PHOEBUS Conclusions l Beowulf is 2.4 times slower than T3D on 16 nodes, 3. times slower on 64 nodes l Slowdown will continue to increase for larger numbers of nodes l T3D is about 3 times slower than T3E l Cost ratio between Beowulf and other machines determines balance points

General Conclusions l Beowulf is a good machine for FDTD l Beowulf may be ok for iterative solutions of sparse matrices, such as those from Finite Element codes, depending on machine size l Key factor: amount of communication