A Parallel Implementation of the 3D NUFFT on Distributed-Memory Systems

Size: px
Start display at page:

Download "A Parallel Implementation of the 3D NUFFT on Distributed-Memory Systems"

Transcription

1 A Parallel Implementation of the 3D NUFFT on Distributed-Memory Systems Yuanxun Bill Bao May 31, Introduction The non-uniform fast Fourier transform (NUFFT) algorithm was originally introduced by Dutt and Rohlin [1] to generalize the FFT algorithm to nonequispaced data on the interval [ π, π]. In d dimensions, the NUFFT algorithm can achieve a complexity of O ( M d log M + N(log 1 ɛ )d), where ɛ is the precision of computation, M is the number of Fourier modes in each dimension and N is the total number data points. The NUFFT algorithm arises in a variety of application and we refer the reader to the discussions in [2, 3]. In this report, we focus on the parallel implementation of the NUFFT on large distributed memory systems. We note that there has been recent developments on parallelizing the NUFFT on massively parallel distributed-memory systems: the PNFFT library [5]. Different from the PNFFT library which is based on the NFFT library in C, our implementation is based on the P3DFFT library [4] in Fortran. Due to the time constraint and scope of the project, our implementation is not yet optimized for performance and is restricted to the transform from the non-uniform physical domain to the uniform frequency domain (Type-1 transform). 2 The NUFFT In 3D, the type-1 NUFFT is mainly concerned with evaluating the sum: F (k 1, k 2, k 3 ) = 1 N N 1 f j e i (k 1,k 2,k 3 ) x j, (2.1) j=0 1

2 where {x j } N j=1 are non-uniformly distributed sources in the domain [ π, π]3, k i { M 2,... M 2 1}, i = 1, 2, 3, and the strength of the source x j is f j = f (x j ). Note that a direct evaluation of the sum (2.1) would result in a total number of O(NM 3 ) operations. Typically, when N M, direct evaluation of the sum is computationally intractable. The 1D NUFFT algorithm can be summarized in three steps: 1. Gridding: for each source x j, spread the strength f j to its nearby M s oversampled regular grid points in both directions by convolving with a gaussian function. The reason we use the gaussian function is that it can be written in terms of a tensor product in higher dimensions. The number of regular oversampled grid points is typically set to be M r = 2M. To be more specific, the contribution due to source x j to the target m is f τ (2π/M r (m + m )) = f j e (x 2π(m+m )/Mr) 2 /4τ, (2.2) where 2πm/M r is the nearest regular grid point of the source x j and M s < m M s. 2. FFT: take the FFT of f τ and get F τ (k). π 3. Deconvolution: F (k) = τ ek2τ F τ (k). In practice, for ɛ = 10 12, we set M s = 12 and τ = 12/M 2. The 1D NUFFT algorithm can be easily generalized to higher dimensions. In d dimensions, the gridding step takes O(24 d N) exponential evaluations, the FFT step takes O(M d log M) operations and the deconvolution step takes O(M 3 ) multiplications. When N M and M is large, the runtime of a sequential NUFFT becomes quite expensive. 3 Distributed Memory Parallelism We choose a distributed-memory parallelism for the NUFFT, since step 1 and 3 are localized and there are existing libraries (FFTW, P3DFFT) to compute the FFT on distributed memory systems. In order to run the NUFFT on massively parallel distributed memory systems, for example, the Stampede, we employ a 2D domain decomposition (pencil-shaped) approach. To be more precise, the x-direction of the computational domain is local in a processor, while the y- and z-direction of the domain are distributed among a 2D grid 2

3 Figure 1: An illustration of a 4 4 2D grid of processors. Each processor has 8 neighbors. of processors (Figure 1). Each processor is responsible for a pencil-shaped chunk of the computational domain (Figure 2). For each processor, we loop through all the sources and perform the gridding step. Inter-processor communication is necessary when nearby regular grid points of a source lie outside the computational domain of that processor (Figure 3a). To carry out this procedure efficiently, we extend the local computational domain of each processor to include a halo of ghost arrays. Therefore, gridding can be done locally first, and then, we send the ghost arrays to the corresponding neighboring processors (Figure 3b). Once every processor has completed ghost array exchanges, we call the parallel version of FFT provided by the P3DFFT library. The deconvolution step can be done locally within each processor. This completes the description of our parallel implementation of the NUFFT algorithm. We discuss the details of inter-processor communications next. 3

4 Figure 2: The 2D domain decomposition of the computational domain (pencil-shaped). (a) (b) Figure 3: (a) An illustration of a source point whose neighboring regular grid points lie outside the computational domain of a processor. (b) An illustration of ghost arrays being sent to their corresponding neighboring processors. 4

5 4 Inter-Processor Communications We are now ready to discuss how inter-processor communications are carried out in our implementation. After each processor completes the gridding step, we need to send the corresponding ghost arrays to its eight neighbors. The order of which ghost arrays are transferred among processors is designed to avoid any hang in the runtime. As an illustration, in Figure 5, we divide the 2D processor grid into two groups: odd-row and even-row processors. We demonstrate how North-South communications are carried out. First, then even-row processors send data to their North neighbors (MPI Send) and wait to receive from their North neighbors (MPI Recv). Meanwhile, the oddrow processors receive data from their South neighbors and send data to the South (Figure 4a). Next, the odd-row processors send data to the North and wait to receive from the North, and the even-row processors receive data from the South and send data to the South (Figure 4b). This completes all North-South data exchanges. The E-W, NE-SW, SE-NW communications can be carried out in a similar fashion. We note that the implementation of inter-processor communication discussed here is not optimal but it effectively avoids hang in runtime. (a) (b) Figure 4: An illustration of North-South data exchange among even- and odd-row processors. 5

6 5 Results We discuss the strong and weak scaling of our parallel implementation of the 3D NUFFT algorithm. For strong scaling, we consider M = 1024 and N = billion sources in [0, 2π] 3. The oversampled grid resolution is set to be 2M. Other parameters are ɛ = 10 12, M s = 12 and τ = 12/M 2. We run on the Stampede with 512, 1024, 2048 and 4096 processors. In Figure 5a, we plot the average time per processor for the total computation, the gridding step, the MPI communication and the FFT versus the number of processors on a log-log scale. First, we notice that cost of the algorithm is dominated by the gridding step. As the number of processors doubles, the total computation time and the gridding time is halved, which shows the strong scaling of our implementation. The MPI communication time, though playing a minor role in terms of cost, also scales strongly. The reason why FFT does not scale strongly is that, after dividing the domain into pencilshaped arrays, they are too small for P3DFFT to show strong scaling. As a remark, we believe that our implementation will continue to scale strongly if more processors can be requested (the maximum normal queue size is 4096 on Stampede). It is worth mentioning that, if the same input data were to run on a single processor, not to even mention the data would fit into the memory, it would take almost 2 days as compared to 40 seconds for 4096 processors. For weak scaling, we keep the work load of each processor the same. For our current implementation, we can only compare input data differed by a factor of 8. We compare 512 vs 4096 processors with sources per processor, and 256 vs 2048 with sources per processor. Figure 5b shows that the time per processor required to do each task is almost the same for 512 vs 4096 processors, and 256 vs 2048 processors, which shows the weak scaling of our implementation. 6 Conclusion In this project, we present a parallel implementation of the 3D NUFFT algorithm on distributed memory systems. Our implementation features a 2D domain decomposition approach, and is able to scale both weakly and strongly on a large distributed memory system (eg. Stampede). Future work includes optimization on memory access and data storage, implementing the type-2, 3 transform, and comparing to a GPU implementation. 6

7 (a) (b) Figure 5: (a) Strong scaling of our parallel implementation of 3D NUFFT. (b) Weak scaling of our parallel implementation of 3D NUFFT. 7

8 References [1] A. Dutt and V. Rokhlin. Fast Fourier transforms for nonequispaced data. SIAM J. Sci. Comput., 14(6): , [2] Leslie Greengard and June-Yub Lee. Accelerating the nonuniform fast Fourier transform. SIAM Rev., 46(3): , [3] June-Yub Lee and Leslie Greengard. The type 3 nonuniform FFT and its applications. J. Comput. Phys., 206(1):1 5, [4] Dmitry Pekurovsky. P3DFFT: a framework for parallel computations of Fourier transforms in three dimensions. SIAM J. Sci. Comput., 34(4):C192 C209, [5] Michael Pippig and Daniel Potts. Parallel three-dimensional nonequispaced fast Fourier transforms and their application to particle simulation. SIAM J. Sci. Comput., 35(4):C411 C437,

Mesh Generation. Quadtrees. Geometric Algorithms. Lecture 9: Quadtrees

Mesh Generation. Quadtrees. Geometric Algorithms. Lecture 9: Quadtrees Lecture 9: Lecture 9: VLSI Design To Lecture 9: Finite Element Method To http://www.antics1.demon.co.uk/finelms.html Lecture 9: To Lecture 9: To component not conforming doesn t respect input not well-shaped

More information

Scalable Parallelization Strategies to Accelerate NuFFT Data Translation on Multicores

Scalable Parallelization Strategies to Accelerate NuFFT Data Translation on Multicores Scalable Parallelization Strategies to Accelerate NuFFT Data Translation on Multicores Yuanrui Zhang,JunLiu,EmreKultursay, Mahmut Kandemir, Nikos Pitsianis 2,3, and Xiaobai Sun 3 Pennsylvania State University,

More information

1. Meshes. D7013E Lecture 14

1. Meshes. D7013E Lecture 14 D7013E Lecture 14 Quadtrees Mesh Generation 1. Meshes Input: Components in the form of disjoint polygonal objects Integer coordinates, 0, 45, 90, or 135 angles Output: A triangular mesh Conforming: A triangle

More information

Five Dimensional Interpolation:exploring different Fourier operators

Five Dimensional Interpolation:exploring different Fourier operators Five Dimensional Interpolation:exploring different Fourier operators Daniel Trad CREWES-University of Calgary Summary Five-Dimensional interpolation has become a very popular method to pre-condition data

More information

A Fast Decimation-in-image Back-projection Algorithm for SAR

A Fast Decimation-in-image Back-projection Algorithm for SAR A Fast Decimation-in-image Back-projection Algorithm for SAR Shaun I. Kelly and Mike E. Davies Institute for Digital Communications The University of Edinburgh email: {Shaun.Kelly, Mike.Davies}@ed.ac.uk

More information

Efficient O(N log N) algorithms for scattered data interpolation

Efficient O(N log N) algorithms for scattered data interpolation Efficient O(N log N) algorithms for scattered data interpolation Nail Gumerov University of Maryland Institute for Advanced Computer Studies Joint work with Ramani Duraiswami February Fourier Talks 2007

More information

NUFFT for Medical and Subsurface Image Reconstruction

NUFFT for Medical and Subsurface Image Reconstruction NUFFT for Medical and Subsurface Image Reconstruction Qing H. Liu Department of Electrical and Computer Engineering Duke University Duke Frontiers 2006 May 16, 2006 Acknowledgment Jiayu Song main contributor

More information

Parallel Implementation of 3D FMA using MPI

Parallel Implementation of 3D FMA using MPI Parallel Implementation of 3D FMA using MPI Eric Jui-Lin Lu y and Daniel I. Okunbor z Computer Science Department University of Missouri - Rolla Rolla, MO 65401 Abstract The simulation of N-body system

More information

Accelerating nonuniform fast Fourier transform via reduction in memory access latency

Accelerating nonuniform fast Fourier transform via reduction in memory access latency Accelerating nonuniform fast Fourier transform via reduction in memory access latency Nihshanka Debroy a, Nikos P. Pitsianis ab and Xiaobai Sun a a Department of Computer Science, Duke University, Durham,

More information

Interpolation error in DNS simulations of turbulence: consequences for particle tracking

Interpolation error in DNS simulations of turbulence: consequences for particle tracking Journal of Physics: Conference Series Interpolation error in DNS simulations of turbulence: consequences for particle tracking To cite this article: M A T van Hinsberg et al 2011 J. Phys.: Conf. Ser. 318

More information

FAST WIDEBAND NEAR-FIELD IMAGING USING THE NON-EQUISPACED FFT WITH APPLICATION TO THROUGH-WALL RADAR

FAST WIDEBAND NEAR-FIELD IMAGING USING THE NON-EQUISPACED FFT WITH APPLICATION TO THROUGH-WALL RADAR 19th European Signal Processing Conference (EUSIPCO 2011) Barcelona, Spain, August 29 - September 2, 2011 FAST WIDEBAND NEAR-FIELD IMAGING USING THE NON-EQUISPACED FFT WITH APPLICATION TO THROUGH-WALL

More information

RAPID COMPUTATION OF THE DISCRETE FOURIER TRANSFORM*

RAPID COMPUTATION OF THE DISCRETE FOURIER TRANSFORM* SIAM J. ScI. COMPUT. Vol. 17, No. 4, pp. 913-919, July 1996 1996 Society for Industrial and Applied Mathematics O08 RAPID COMPUTATION OF THE DISCRETE FOURIER TRANSFORM* CHRIS ANDERSON AND MARIE DILLON

More information

Laplace Exercise Solution Review

Laplace Exercise Solution Review Laplace Exercise Solution Review John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center Copyright 2017 Finished? If you have finished, we can review a few principles that you have inevitably

More information

The Case for Collective Pattern Specification

The Case for Collective Pattern Specification The Case for Collective Pattern Specification Torsten Hoefler, Jeremiah Willcock, ArunChauhan, and Andrew Lumsdaine Advances in Message Passing, Toronto, ON, June 2010 Motivation and Main Theses Message

More information

A4. Intro to Parallel Computing

A4. Intro to Parallel Computing Self-Consistent Simulations of Beam and Plasma Systems Steven M. Lund, Jean-Luc Vay, Rémi Lehe and Daniel Winklehner Colorado State U., Ft. Collins, CO, 13-17 June, 2016 A4. Intro to Parallel Computing

More information

A Scalable Adaptive Mesh Refinement Framework For Parallel Astrophysics Applications

A Scalable Adaptive Mesh Refinement Framework For Parallel Astrophysics Applications A Scalable Adaptive Mesh Refinement Framework For Parallel Astrophysics Applications James Bordner, Michael L. Norman San Diego Supercomputer Center University of California, San Diego 15th SIAM Conference

More information

Landscape Ecology. Lab 2: Indices of Landscape Pattern

Landscape Ecology. Lab 2: Indices of Landscape Pattern Introduction In this lab exercise we explore some metrics commonly used to summarize landscape pattern. You will begin with a presettlement landscape entirely covered in forest. You will then develop this

More information

Adaptive Matrix Transpose Algorithms for Distributed Multicore Processors

Adaptive Matrix Transpose Algorithms for Distributed Multicore Processors Adaptive Matrix Transpose Algorithms for Distributed Multicore ors John C. Bowman and Malcolm Roberts Abstract An adaptive parallel matrix transpose algorithm optimized for distributed multicore architectures

More information

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering

More information

Optimized Least-Square Nonuniform Fast Fourier Transform Mathews Jacob, Member, IEEE

Optimized Least-Square Nonuniform Fast Fourier Transform Mathews Jacob, Member, IEEE IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, NO. 6, JUNE 2009 2165 Optimized Least-Square Nonuniform Fast Fourier Transform Mathews Jacob, Member, IEEE Abstract The main focus of this paper is to derive

More information

Implementation and evaluation of 3D FFT parallel algorithms based on software component model

Implementation and evaluation of 3D FFT parallel algorithms based on software component model Master 2 - Visualisation Image Performance University of Orléans (2013-2014) Implementation and evaluation of 3D FFT parallel algorithms based on software component model Jérôme RICHARD October 7th 2014

More information

A Clifford Fourier Transform for Vector Field Analysis and Visualization

A Clifford Fourier Transform for Vector Field Analysis and Visualization Computational Fluid Dynamics JOURNAL vol.?? no.? June 2006 (pp. ) A Clifford Fourier Transform for Vector Field Analysis and Visualization Michael Schlemmer Ingrid Hotz Vijay Natarajan Bernd Hamann Hans

More information

Adaptive Transpose Algorithms for Distributed Multicore Processors

Adaptive Transpose Algorithms for Distributed Multicore Processors Adaptive Transpose Algorithms for Distributed Multicore Processors John C. Bowman and Malcolm Roberts University of Alberta and Université de Strasbourg April 15, 2016 www.math.ualberta.ca/ bowman/talks

More information

CSC630/CSC730 Parallel & Distributed Computing

CSC630/CSC730 Parallel & Distributed Computing CSC630/CSC730 Parallel & Distributed Computing Analytical Modeling of Parallel Programs Chapter 5 1 Contents Sources of Parallel Overhead Performance Metrics Granularity and Data Mapping Scalability 2

More information

CAF versus MPI Applicability of Coarray Fortran to a Flow Solver

CAF versus MPI Applicability of Coarray Fortran to a Flow Solver CAF versus MPI Applicability of Coarray Fortran to a Flow Solver Manuel Hasert, Harald Klimach, Sabine Roller m.hasert@grs-sim.de Applied Supercomputing in Engineering Motivation We develop several CFD

More information

AUTOMATIC PARALLEL CODE GENERATION FOR NUFFT DATA TRANSLATION ON MULTICORES

AUTOMATIC PARALLEL CODE GENERATION FOR NUFFT DATA TRANSLATION ON MULTICORES Journal of Circuits, Systems, and Computers Vol. 2, No. 2 (202) 2 #.c World Scienti c Publishing Company DOI: 0.42/S02826620088 2 2 2 2 2 AUTOMATIC PARALLEL CODE GENERATION FOR NUFFT DATA TRANSLATION ON

More information

Predictive Indexing for Fast Search

Predictive Indexing for Fast Search Predictive Indexing for Fast Search Sharad Goel, John Langford and Alex Strehl Yahoo! Research, New York Modern Massive Data Sets (MMDS) June 25, 2008 Goel, Langford & Strehl (Yahoo! Research) Predictive

More information

Nuts & Bolts of Advanced Imaging. Image Reconstruction Parallel Imaging

Nuts & Bolts of Advanced Imaging. Image Reconstruction Parallel Imaging Nuts & Bolts of Advanced Imaging Image Reconstruction Parallel Imaging Michael S. Hansen, PhD Magnetic Resonance Technology Program National Institutes of Health, NHLBI Declaration of Financial Interests

More information

Fourier transforms and convolution

Fourier transforms and convolution Fourier transforms and convolution (without the agonizing pain) CS/CME/BioE/Biophys/BMI 279 Oct. 26, 2017 Ron Dror 1 Why do we care? Fourier transforms Outline Writing functions as sums of sinusoids The

More information

Intro to Parallel Computing

Intro to Parallel Computing Outline Intro to Parallel Computing Remi Lehe Lawrence Berkeley National Laboratory Modern parallel architectures Parallelization between nodes: MPI Parallelization within one node: OpenMP Why use parallel

More information

Scalable GPU Graph Traversal!

Scalable GPU Graph Traversal! Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang

More information

Benchmark runs of pcmalib on Nehalem and Shanghai nodes

Benchmark runs of pcmalib on Nehalem and Shanghai nodes MOSAIC group Institute of Theoretical Computer Science Department of Computer Science Benchmark runs of pcmalib on Nehalem and Shanghai nodes Christian Lorenz Müller, April 9 Addresses: Institute for Theoretical

More information

GAUSSIAN convolution filters are frequently used tools in

GAUSSIAN convolution filters are frequently used tools in 3502 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 15, NO. 11, NOVEMBER 2006 An Optimal Nonorthogonal Separation of the Anisotropic Gaussian Convolution Filter Christoph H. Lampert and Oliver Wirjadi, Student

More information

PARALLEL SIMULATION OF A FLUID FLOW BY MEANS OF THE SPH METHOD: OPENMP VS. MPI COMPARISON. Pawe l Wróblewski, Krzysztof Boryczko

PARALLEL SIMULATION OF A FLUID FLOW BY MEANS OF THE SPH METHOD: OPENMP VS. MPI COMPARISON. Pawe l Wróblewski, Krzysztof Boryczko Computing and Informatics, Vol. 28, 2009, 139 150 PARALLEL SIMULATION OF A FLUID FLOW BY MEANS OF THE SPH METHOD: OPENMP VS. MPI COMPARISON Pawe l Wróblewski, Krzysztof Boryczko Department of Computer

More information

Data parallel algorithms 1

Data parallel algorithms 1 Data parallel algorithms (Guy Steele): The data-parallel programming style is an approach to organizing programs suitable for execution on massively parallel computers. In this lecture, we will characterize

More information

High Scalability of Lattice Boltzmann Simulations with Turbulence Models using Heterogeneous Clusters

High Scalability of Lattice Boltzmann Simulations with Turbulence Models using Heterogeneous Clusters SIAM PP 2014 High Scalability of Lattice Boltzmann Simulations with Turbulence Models using Heterogeneous Clusters C. Riesinger, A. Bakhtiari, M. Schreiber Technische Universität München February 20, 2014

More information

Image processing in frequency Domain

Image processing in frequency Domain Image processing in frequency Domain Introduction to Frequency Domain Deal with images in: -Spatial domain -Frequency domain Frequency Domain In the frequency or Fourier domain, the value and location

More information

A Deterministic Fault-Tolerant and Deadlock-Free Routing Protocol in 2-D Meshes Based on Odd-Even Turn Model

A Deterministic Fault-Tolerant and Deadlock-Free Routing Protocol in 2-D Meshes Based on Odd-Even Turn Model A Deterministic Fault-Tolerant and Deadlock-Free Routing Protocol in 2-D Meshes Based on Odd-Even Turn Model Jie Wu Dept. of Computer Science and Engineering Florida Atlantic University Boca Raton, FL

More information

The Barnes-Hut Algorithm in MapReduce

The Barnes-Hut Algorithm in MapReduce The Barnes-Hut Algorithm in MapReduce Ross Adelman radelman@gmail.com 1. INTRODUCTION For my end-of-semester project, I implemented an N-body solver in MapReduce using Hadoop. The N-body problem is a classical

More information

A Graph Theoretic Approach to Image Database Retrieval

A Graph Theoretic Approach to Image Database Retrieval A Graph Theoretic Approach to Image Database Retrieval Selim Aksoy and Robert M. Haralick Intelligent Systems Laboratory Department of Electrical Engineering University of Washington, Seattle, WA 98195-2500

More information

Spectral Sidelobe Suppression of Nonuniform Pulse Repetition Frequency Waveforms in Strong Clutter

Spectral Sidelobe Suppression of Nonuniform Pulse Repetition Frequency Waveforms in Strong Clutter Spectral Sidelobe Suppression of Nonuniform Pulse Repetition Frequency Waveforms in Strong Clutter Sandun Kodituwakku, Van Khanh Nguyen, Mike D. Turley National Security and ISR Division Defence Science

More information

18.S34 (FALL 2007) PROBLEMS ON HIDDEN INDEPENDENCE AND UNIFORMITY

18.S34 (FALL 2007) PROBLEMS ON HIDDEN INDEPENDENCE AND UNIFORMITY 18.S34 (FALL 2007) PROBLEMS ON HIDDEN INDEPENDENCE AND UNIFORMITY All the problems below (with the possible exception of the last one), when looked at the right way, can be solved by elegant arguments

More information

Partitioning and Divide-and-Conquer Strategies

Partitioning and Divide-and-Conquer Strategies Chapter 4 Partitioning and Divide-and-Conquer Strategies Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,

More information

Laplace Exercise Solution Review

Laplace Exercise Solution Review Laplace Exercise Solution Review John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center Copyright 2018 Finished? If you have finished, we can review a few principles that you have inevitably

More information

An Analysis of FFT Performance in PRACE Application Codes

An Analysis of FFT Performance in PRACE Application Codes An Analysis of FFT Performance in PRACE Application Codes Andrew Sunderland a, Stephen Pickles a, Miloš Nikolić b, Aleksandar Jović b, Josip Jakić b, Vladimir Slavnić b, Ivan Girotto c, Peter Nash c, Michael

More information

MPI Casestudy: Parallel Image Processing

MPI Casestudy: Parallel Image Processing MPI Casestudy: Parallel Image Processing David Henty 1 Introduction The aim of this exercise is to write a complete MPI parallel program that does a very basic form of image processing. We will start by

More information

Switch Jitter. John Hague IBM consultant Nov/08

Switch Jitter. John Hague IBM consultant Nov/08 Switch Jitter John Hague IBM consultant Nov/8 Introduction Investigate Halo exchange time One of simplest communication patterns Expect increase with number of MPI tasks Will not identity cause of jitter

More information

CONTENT ADAPTIVE SCREEN IMAGE SCALING

CONTENT ADAPTIVE SCREEN IMAGE SCALING CONTENT ADAPTIVE SCREEN IMAGE SCALING Yao Zhai (*), Qifei Wang, Yan Lu, Shipeng Li University of Science and Technology of China, Hefei, Anhui, 37, China Microsoft Research, Beijing, 8, China ABSTRACT

More information

FFT, FMM, OR MULTIGRID? A COMPARATIVE STUDY OF STATE-OF-THE-ART POISSON SOLVERS FOR UNIFORM AND NON-UNIFORM GRIDS IN THE UNIT CUBE

FFT, FMM, OR MULTIGRID? A COMPARATIVE STUDY OF STATE-OF-THE-ART POISSON SOLVERS FOR UNIFORM AND NON-UNIFORM GRIDS IN THE UNIT CUBE FFT, FMM, OR MULTIGRID? A COMPARATIVE STUDY OF STATE-OF-THE-ART POISSON SOLVERS FOR UNIFORM AND NON-UNIFORM GRIDS IN THE UNIT CUBE AMIR GHOLAMI, DHAIRYA MALHOTRA, HARI SUNDAR, AND GEORGE BIROS Abstract.

More information

Load Balancing and Data Migration in a Hybrid Computational Fluid Dynamics Application

Load Balancing and Data Migration in a Hybrid Computational Fluid Dynamics Application Load Balancing and Data Migration in a Hybrid Computational Fluid Dynamics Application Esteban Meneses Patrick Pisciuneri Center for Simulation and Modeling (SaM) University of Pittsburgh University of

More information

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004 A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into

More information

LARGE-EDDY EDDY SIMULATION CODE FOR CITY SCALE ENVIRONMENTS

LARGE-EDDY EDDY SIMULATION CODE FOR CITY SCALE ENVIRONMENTS ARCHER ECSE 05-14 LARGE-EDDY EDDY SIMULATION CODE FOR CITY SCALE ENVIRONMENTS TECHNICAL REPORT Vladimír Fuka, Zheng-Tong Xie Abstract The atmospheric large eddy simulation code ELMM (Extended Large-eddy

More information

FORSCHUNGSZENTRUM JÜLICH GmbH Jülich Supercomputing Centre D Jülich, Tel. (02461)

FORSCHUNGSZENTRUM JÜLICH GmbH Jülich Supercomputing Centre D Jülich, Tel. (02461) FORSCHUNGSZENTRUM JÜLICH GmbH Jülich Supercomputing Centre D-52425 Jülich, Tel. (02461) 61-6402 Technical Report Benchmark of fast Coulomb Solvers for open and periodic boundary conditions Sebastian Krumscheid

More information

Spline Curves. Spline Curves. Prof. Dr. Hans Hagen Algorithmic Geometry WS 2013/2014 1

Spline Curves. Spline Curves. Prof. Dr. Hans Hagen Algorithmic Geometry WS 2013/2014 1 Spline Curves Prof. Dr. Hans Hagen Algorithmic Geometry WS 2013/2014 1 Problem: In the previous chapter, we have seen that interpolating polynomials, especially those of high degree, tend to produce strong

More information

Code Parallelization

Code Parallelization Code Parallelization a guided walk-through m.cestari@cineca.it f.salvadore@cineca.it Summer School ed. 2015 Code Parallelization two stages to write a parallel code problem domain algorithm program domain

More information

Parallel FFT Libraries

Parallel FFT Libraries Parallel FFT Libraries Evangelos Brachos August 19, 2011 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2011 Abstract The focus of this project is the area of the fast

More information

Multicore ZPL. Steven P. Smith. A senior thesis submitted in partial fulfillment of. the requirements for the degree of

Multicore ZPL. Steven P. Smith. A senior thesis submitted in partial fulfillment of. the requirements for the degree of Multicore ZPL By Steven P. Smith A senior thesis submitted in partial fulfillment of the requirements for the degree of Bachelor of Science With Departmental Honors Computer Science & Engineering University

More information

Digital Signal Processing. Soma Biswas

Digital Signal Processing. Soma Biswas Digital Signal Processing Soma Biswas 2017 Partial credit for slides: Dr. Manojit Pramanik Outline What is FFT? Types of FFT covered in this lecture Decimation in Time (DIT) Decimation in Frequency (DIF)

More information

Maximizing the Spread of Influence through a Social Network

Maximizing the Spread of Influence through a Social Network Maximizing the Spread of Influence through a Social Network By David Kempe, Jon Kleinberg, Eva Tardos Report by Joe Abrams Social Networks Infectious disease networks Viral Marketing Viral Marketing Example:

More information

A Random Variable Shape Parameter Strategy for Radial Basis Function Approximation Methods

A Random Variable Shape Parameter Strategy for Radial Basis Function Approximation Methods A Random Variable Shape Parameter Strategy for Radial Basis Function Approximation Methods Scott A. Sarra, Derek Sturgill Marshall University, Department of Mathematics, One John Marshall Drive, Huntington

More information

Key words. Poisson Solvers, Fast Fourier Transform, Fast Multipole Method, Multigrid, Parallel Computing, Exascale algorithms, Co-Design

Key words. Poisson Solvers, Fast Fourier Transform, Fast Multipole Method, Multigrid, Parallel Computing, Exascale algorithms, Co-Design FFT, FMM, OR MULTIGRID? A COMPARATIVE STUDY OF STATE-OF-THE-ART POISSON SOLVERS AMIR GHOLAMI, DHAIRYA MALHOTRA, HARI SUNDAR,, AND GEORGE BIROS Abstract. We discuss the fast solution of the Poisson problem

More information

PARALLEL METHODS FOR SOLVING PARTIAL DIFFERENTIAL EQUATIONS. Ioana Chiorean

PARALLEL METHODS FOR SOLVING PARTIAL DIFFERENTIAL EQUATIONS. Ioana Chiorean 5 Kragujevac J. Math. 25 (2003) 5 18. PARALLEL METHODS FOR SOLVING PARTIAL DIFFERENTIAL EQUATIONS Ioana Chiorean Babeş-Bolyai University, Department of Mathematics, Cluj-Napoca, Romania (Received May 28,

More information

GPR Migration Imaging Algorithm Based on NUFFT

GPR Migration Imaging Algorithm Based on NUFFT PIERS ONLINE, VOL. 6, NO. 1, 010 16 GPR Migration Imaging Algorithm Based on NUFFT Hao Chen, Renbiao Wu, Jiaxue Liu, and Zhiyong Han Tianjin Key Laboratory for Advanced Signal Processing, Civil Aviation

More information

Adaptive Matrix Transpose Algorithms for Distributed Multicore Processors

Adaptive Matrix Transpose Algorithms for Distributed Multicore Processors Adaptive Matrix Transpose Algorithms for Distributed Multicore ors John C. Bowman and Malcolm Roberts Abstract An adaptive parallel matrix transpose algorithm optimized for distributed multicore architectures

More information

Large Scale Parallel Lattice Boltzmann Model of Dendritic Growth

Large Scale Parallel Lattice Boltzmann Model of Dendritic Growth Large Scale Parallel Lattice Boltzmann Model of Dendritic Growth Bohumir Jelinek Mohsen Eshraghi Sergio Felicelli CAVS, Mississippi State University March 3-7, 2013 San Antonio, Texas US Army Corps of

More information

Contents. Implementing the QR factorization The algebraic eigenvalue problem. Applied Linear Algebra in Geoscience Using MATLAB

Contents. Implementing the QR factorization The algebraic eigenvalue problem. Applied Linear Algebra in Geoscience Using MATLAB Applied Linear Algebra in Geoscience Using MATLAB Contents Getting Started Creating Arrays Mathematical Operations with Arrays Using Script Files and Managing Data Two-Dimensional Plots Programming in

More information

Partial Wave Analysis using Graphics Cards

Partial Wave Analysis using Graphics Cards Partial Wave Analysis using Graphics Cards Niklaus Berger IHEP Beijing Hadron 2011, München The (computational) problem with partial wave analysis n rec * * i=1 * 1 Ngen MC NMC * i=1 A complex calculation

More information

Discovery of the Source of Contaminant Release

Discovery of the Source of Contaminant Release Discovery of the Source of Contaminant Release Devina Sanjaya 1 Henry Qin Introduction Computer ability to model contaminant release events and predict the source of release in real time is crucial in

More information

University of Florida CISE department Gator Engineering. Clustering Part 4

University of Florida CISE department Gator Engineering. Clustering Part 4 Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

Mean square optimal NUFFT approximation for efficient non-cartesian MRI reconstruction

Mean square optimal NUFFT approximation for efficient non-cartesian MRI reconstruction Mean square optimal NUFFT approimation for efficient non-cartesian MRI reconstruction Zhili Yang and Mathews Jacob Abstract The fast evaluation of the discrete Fourier transform of an image at non-uniform

More information

Chapter 2 Basic Structure of High-Dimensional Spaces

Chapter 2 Basic Structure of High-Dimensional Spaces Chapter 2 Basic Structure of High-Dimensional Spaces Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple,

More information

Exploiting Depth Camera for 3D Spatial Relationship Interpretation

Exploiting Depth Camera for 3D Spatial Relationship Interpretation Exploiting Depth Camera for 3D Spatial Relationship Interpretation Jun Ye Kien A. Hua Data Systems Group, University of Central Florida Mar 1, 2013 Jun Ye and Kien A. Hua (UCF) 3D directional spatial relationships

More information

High-Resolution 3-D Radar Imaging through Nonuniform Fast Fourier Transform (NUFFT)

High-Resolution 3-D Radar Imaging through Nonuniform Fast Fourier Transform (NUFFT) COMMUNICATIONS IN COMPUTATIONAL PHYSICS Vol. 1, No. 1, pp. 176-191 Commun. Comput. Phys. February 2006 High-Resolution 3-D Radar Imaging through Nonuniform Fast Fourier Transform (NUFFT) Jiayu Song 1,

More information

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT Krishna Kandalla (1), Hari Subramoni (1), Karen Tomko (2), Dmitry Pekurovsky

More information

Feature Descriptors. CS 510 Lecture #21 April 29 th, 2013

Feature Descriptors. CS 510 Lecture #21 April 29 th, 2013 Feature Descriptors CS 510 Lecture #21 April 29 th, 2013 Programming Assignment #4 Due two weeks from today Any questions? How is it going? Where are we? We have two umbrella schemes for object recognition

More information

Examination in Image Processing

Examination in Image Processing Umeå University, TFE Ulrik Söderström 203-03-27 Examination in Image Processing Time for examination: 4.00 20.00 Please try to extend the answers as much as possible. Do not answer in a single sentence.

More information

An Efficient Boundary Integral Scheme for the Threshold Dynamics Method II: Applications to Wetting Dynamics

An Efficient Boundary Integral Scheme for the Threshold Dynamics Method II: Applications to Wetting Dynamics Noname manuscript No. (will be inserted by the editor) An Efficient Boundary Integral Scheme for the Threshold Dynamics Method II: Applications to Wetting Dynamics Dong Wang Shidong Jiang Xiao-Ping Wang

More information

HPC Fall 2010 Final Project 3 2D Steady-State Heat Distribution with MPI

HPC Fall 2010 Final Project 3 2D Steady-State Heat Distribution with MPI HPC Fall 2010 Final Project 3 2D Steady-State Heat Distribution with MPI Robert van Engelen Due date: December 10, 2010 1 Introduction 1.1 HPC Account Setup and Login Procedure Same as in Project 1. 1.2

More information

Steen Moeller Center for Magnetic Resonance research University of Minnesota

Steen Moeller Center for Magnetic Resonance research University of Minnesota Steen Moeller Center for Magnetic Resonance research University of Minnesota moeller@cmrr.umn.edu Lot of material is from a talk by Douglas C. Noll Department of Biomedical Engineering Functional MRI Laboratory

More information

Fast Spherical Filtering in the Broadband FMBEM using a nonequally

Fast Spherical Filtering in the Broadband FMBEM using a nonequally Fast Spherical Filtering in the Broadband FMBEM using a nonequally spaced FFT Daniel R. Wilkes (1) and Alec. J. Duncan (1) (1) Centre for Marine Science and Technology, Department of Imaging and Applied

More information

Clustering Part 4 DBSCAN

Clustering Part 4 DBSCAN Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

Supplementary Material for The Generalized PatchMatch Correspondence Algorithm

Supplementary Material for The Generalized PatchMatch Correspondence Algorithm Supplementary Material for The Generalized PatchMatch Correspondence Algorithm Connelly Barnes 1, Eli Shechtman 2, Dan B Goldman 2, Adam Finkelstein 1 1 Princeton University, 2 Adobe Systems 1 Overview

More information

Computational Aspects of MRI

Computational Aspects of MRI David Atkinson Philip Batchelor David Larkman Programme 09:30 11:00 Fourier, sampling, gridding, interpolation. Matrices and Linear Algebra 11:30 13:00 MRI Lunch (not provided) 14:00 15:30 SVD, eigenvalues.

More information

Introduction to Parallel Computing!

Introduction to Parallel Computing! Introduction to Parallel Computing! SDSC Summer Institute! August 6-10, 2012 San Diego, CA! Rick Wagner! HPC Systems Manager! Purpose, Goals, Outline, etc.! Introduce broad concepts " Define terms " Explore

More information

Scalable Dynamic Load Balancing of Detailed Cloud Physics with FD4

Scalable Dynamic Load Balancing of Detailed Cloud Physics with FD4 Center for Information Services and High Performance Computing (ZIH) Scalable Dynamic Load Balancing of Detailed Cloud Physics with FD4 Minisymposium on Advances in Numerics and Physical Modeling for Geophysical

More information

MRF-based Algorithms for Segmentation of SAR Images

MRF-based Algorithms for Segmentation of SAR Images This paper originally appeared in the Proceedings of the 998 International Conference on Image Processing, v. 3, pp. 770-774, IEEE, Chicago, (998) MRF-based Algorithms for Segmentation of SAR Images Robert

More information

Radix-4 FFT Algorithms *

Radix-4 FFT Algorithms * OpenStax-CNX module: m107 1 Radix-4 FFT Algorithms * Douglas L Jones This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 10 The radix-4 decimation-in-time

More information

Homework #4 Due Friday 10/27/06 at 5pm

Homework #4 Due Friday 10/27/06 at 5pm CSE 160, Fall 2006 University of California, San Diego Homework #4 Due Friday 10/27/06 at 5pm 1. Interconnect. A k-ary d-cube is an interconnection network with k d nodes, and is a generalization of the

More information

Parallelization of DQMC Simulations for Strongly Correlated Electron Systems

Parallelization of DQMC Simulations for Strongly Correlated Electron Systems Parallelization of DQMC Simulations for Strongly Correlated Electron Systems Che-Rung Lee Dept. of Computer Science National Tsing-Hua University Taiwan joint work with I-Hsin Chung (IBM Research), Zhaojun

More information

Nonuniform Fast Fourier Transforms Using Min-Max Interpolation

Nonuniform Fast Fourier Transforms Using Min-Max Interpolation 560 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 51, NO. 2, FEBRUARY 2003 Nonuniform Fast Fourier Transforms Using Min-Max Interpolation Jeffrey A. Fessler, Senior Member, IEEE, and Bradley P. Sutton,

More information

Parallel Systems Prof. James L. Frankel Harvard University. Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved.

Parallel Systems Prof. James L. Frankel Harvard University. Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved. Parallel Systems Prof. James L. Frankel Harvard University Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved. Architectures SISD (Single Instruction, Single Data)

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,

More information

Aliasing and Antialiasing. ITCS 4120/ Aliasing and Antialiasing

Aliasing and Antialiasing. ITCS 4120/ Aliasing and Antialiasing Aliasing and Antialiasing ITCS 4120/5120 1 Aliasing and Antialiasing What is Aliasing? Errors and Artifacts arising during rendering, due to the conversion from a continuously defined illumination field

More information

Picture quality requirements and NUT proposals for JPEG AIC

Picture quality requirements and NUT proposals for JPEG AIC Mar. 2006, Cupertino Picture quality requirements and NUT proposals for JPEG AIC Jae-Jeong Hwang, Young Huh, Dai-Gyoung Kim Kunsan National Univ., KERI, Hanyang Univ. hwang@kunsan.ac.kr Contents 1. Picture

More information

Empirical Analysis of Space Filling Curves for Scientific Computing Applications

Empirical Analysis of Space Filling Curves for Scientific Computing Applications Empirical Analysis of Space Filling Curves for Scientific Computing Applications Daryl DeFord 1 Ananth Kalyanaraman 2 1 Dartmouth College Department of Mathematics 2 Washington State University School

More information

Intermediate Parallel Programming & Cluster Computing

Intermediate Parallel Programming & Cluster Computing High Performance Computing Modernization Program (HPCMP) Summer 2011 Puerto Rico Workshop on Intermediate Parallel Programming & Cluster Computing in conjunction with the National Computational Science

More information

ELGIN ACADEMY Mathematics Department Evaluation Booklet (Main) Name Reg

ELGIN ACADEMY Mathematics Department Evaluation Booklet (Main) Name Reg ELGIN ACADEMY Mathematics Department Evaluation Booklet (Main) Name Reg CfEM You should be able to use this evaluation booklet to help chart your progress in the Maths department from August in S1 until

More information

Spatial Interpolation & Geostatistics

Spatial Interpolation & Geostatistics (Z i Z j ) 2 / 2 Spatial Interpolation & Geostatistics Lag Lag Mean Distance between pairs of points 11/3/2016 GEO327G/386G, UT Austin 1 Tobler s Law All places are related, but nearby places are related

More information

High Performance Computing. Introduction to Parallel Computing

High Performance Computing. Introduction to Parallel Computing High Performance Computing Introduction to Parallel Computing Acknowledgements Content of the following presentation is borrowed from The Lawrence Livermore National Laboratory https://hpc.llnl.gov/training/tutorials

More information

Spatial Interpolation & Geostatistics

Spatial Interpolation & Geostatistics (Z i Z j ) 2 / 2 Spatial Interpolation & Geostatistics Lag Lag Mean Distance between pairs of points 1 Tobler s Law All places are related, but nearby places are related more than distant places Corollary:

More information