Parallelization of FFT in AFNI

Similar documents
Using a Scalable Parallel 2D FFT for Image Enhancement

Latest Innovation For FFT implementation using RCBNS

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

Introduction to Parallel Programming

Assignment 3 MPI Tutorial Compiling and Executing MPI programs

A TIMING AND SCALABILITY ANALYSIS OF THE PARALLEL PERFORMANCE OF CMAQ v4.5 ON A BEOWULF LINUX CLUSTER

Digital Signal Processing. Soma Biswas

Performance Comparison between Blocking and Non-Blocking Communications for a Three-Dimensional Poisson Problem

Page 1. Program Performance Metrics. Program Performance Metrics. Amdahl s Law. 1 seq seq 1

Parallel Algorithms for the Third Extension of the Sieve of Eratosthenes. Todd A. Whittaker Ohio State University

6. Fast Fourier Transform

Programming with MPI. Pedro Velho

Programming with MPI on GridRS. Dr. Márcio Castro e Dr. Pedro Velho

CHAPTER 6 A SECURE FAST 2D-DISCRETE FRACTIONAL FOURIER TRANSFORM BASED MEDICAL IMAGE COMPRESSION USING SPIHT ALGORITHM WITH HUFFMAN ENCODER

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

Peter Pacheco. Chapter 3. Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved

Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved

MPI Tutorial. Shao-Ching Huang. High Performance Computing Group UCLA Institute for Digital Research and Education

Faculty of Electrical and Computer Engineering Department of Electrical and Computer Engineering Program: Computer Engineering

Rapid Prototyping System for Teaching Real-Time Digital Signal Processing

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

Module 9 : Numerical Relaying II : DSP Perspective

Twiddle Factor Transformation for Pipelined FFT Processing

CSE. Parallel Algorithms on a cluster of PCs. Ian Bush. Daresbury Laboratory (With thanks to Lorna Smith and Mark Bull at EPCC)

Chapter 3. Distributed Memory Programming with MPI

Unit 9 : Fundamentals of Parallel Processing

Quiz for Chapter 1 Computer Abstractions and Technology

Anadarko Public Schools MATH Power Standards

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

Parallel-computing approach for FFT implementation on digital signal processor (DSP)

DESIGN METHODOLOGY. 5.1 General

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2

ITCS 4145/5145 Assignment 2

Message Passing Interface (MPI)

PARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION

Cache-oblivious Programming

Image Compression System on an FPGA

Chapter 1. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

1:21. Down sampling/under sampling. The spectrum has the same shape, but the periodicity is twice as dense.

CS 426. Building and Running a Parallel Application

[4] 1 cycle takes 1/(3x10 9 ) seconds. One access to memory takes 50/(3x10 9 ) seconds. =16ns. Performance = 4 FLOPS / (2x50/(3x10 9 )) = 120 MFLOPS.

Outline. CS38 Introduction to Algorithms. Fast Fourier Transform (FFT) Fast Fourier Transform (FFT) Fast Fourier Transform (FFT)

DESIGN AND ANALYSIS OF ALGORITHMS. Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Pipelining and Vector Processing

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Introduction to Medical Imaging. Cone-Beam CT. Klaus Mueller. Computer Science Department Stony Brook University

High-Performance and Parallel Computing

Complexity Measures for Map-Reduce, and Comparison to Parallel Computing

Solution of Exercise Sheet 2

HPC Parallel Programing Multi-node Computation with MPI - I

Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES

SDP Memo 048: Two Dimensional Sparse Fourier Transform Algorithms

Analysis of Radix- SDF Pipeline FFT Architecture in VLSI Using Chip Scope

Fiber Fourier optics

Review: Creating a Parallel Program. Programming for Performance

CO Computer Architecture and Programming Languages CAPL. Lecture 15

Introduction to Algorithms

Abstract. Introduction. Kevin Todisco

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core

Digital Image Processing. Image Enhancement in the Frequency Domain

Parallel Programming Assignment 3 Compiling and running MPI programs

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures

CITS2401 Computer Analysis & Visualisation

The Fast Fourier Transform Algorithm and Its Application in Digital Image Processing

Computer Caches. Lab 1. Caching

Back-Projection on GPU: Improving the Performance

MPI. (message passing, MIMD)

Technical Report. SLI Best Practices

MPI Message Passing Interface

Theory Implementation Results Conclusion References. Cloud Computing. Special Task 2 - Parallel Merge Sort with MPI Summer Term 2018

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh

University of Southern California Department of Electrical Engineering EE657 Spring 2K2 Instructor: Michel Dubois Homework #2.

MPI: The Message-Passing Interface. Most of this discussion is from [1] and [2].

Parallel Fast Fourier Transform implementations in Julia 12/15/2011

ENT 315 Medical Signal Processing CHAPTER 3 FAST FOURIER TRANSFORM. Dr. Lim Chee Chin

Implementation of Lifting-Based Two Dimensional Discrete Wavelet Transform on FPGA Using Pipeline Architecture

Using Java for Scientific Computing. Mark Bul EPCC, University of Edinburgh

Why Multiprocessors?

1 The range query problem

6.375 Ray Tracing Hardware Accelerator

Tutorial: Analyzing MPI Applications. Intel Trace Analyzer and Collector Intel VTune Amplifier XE

Annals of the University of North Carolina Wilmington Master of Science in Computer Science and Information Systems

C for Engineers and Scientists: An Interpretive Approach. Chapter 10: Arrays

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Improving Http-Server Performance by Adapted Multithreading

A Parallel Evolutionary Algorithm for Discovery of Decision Rules

Fixed Point Streaming Fft Processor For Ofdm

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Parallel FFT Program Optimizations on Heterogeneous Computers

White Paper Taking Advantage of Advances in FPGA Floating-Point IP Cores

Scalability of Heterogeneous Computing

Parallel Computing and the MPI environment

Introduction to High-Performance Computing

Towards Breast Anatomy Simulation Using GPUs

Transcription:

Parallelization of FFT in AFNI Huang, Jingshan and Xi, Hong Department of Computer Science and Engineering University of South Carolina Columbia, SC huang27@cse.sc.edu and xi@cse.sc.edu Abstract. AFNI is a widely used software package for medical image processing. However it is not a real-time system. We are trying to make a parallelized version of AFNI, which will be nearly real-time. As the first step of this enormous task, we parallelize the FFT part of AFNI. Our method is based on MPI. The result shows that for the FFT algorithm itself, we obtain a speedup of around 2.62 in CPU time. 1 Introduction AFNI is a widely used software package for medical image processing. However, there is a big drawback of this system: it is not a real-time system, which to some extent has impaired the application in the related areas. Our ultimate goal is to make a parallelized version of AFNI, which will have the property of nearly realtime. As our first step, we would like to parallelize the FFT part of AFNI. One of the reasons that we choose FFT is that inside AFNI, FFT is extensively called by other functions. In fact, FFT is a fundamental function in AFNI. Therefore, to obtain a parallel version of FFT in AFNI will definitely have great significance in achieving our final goal. 2 Background 2.1 AFNI AFNI stands for Analysis of Functional NeuroImages. It is a set of C programs (over 1, source code files) for processing, analyzing, and displaying functional MRI (FMRI) data - a technique for mapping human brain activity. AFNI runs on Unix+X11+Motif systems, including SGI, Solaris, Linux, and Mac OS X. AFNI is an interactive program for viewing the results of 3D functional neuroimaging. It can overlay the (usually) low-resolution results of functional brain scans onto higher resolution structural volume data sets. By marking fiducial points, one may transform the data to the proportional grid (stereotaxic coordinates) of Talairach and Tournoux [1]. Time-dependent 3D volume data sets can also be created and viewed. In addition, there are some auxiliary programs provided for combining and editing 3D and 3D+time functional data sets. Although being widely used in the area of medical image processing, AFNI is not a real-time software package, which in some degree influences its application in some cases. 2.2 FFT[3] FFT, Fast Fourier Transform, is a discrete Fourier transform algorithm from discrete time domain to discrete spatial domain. It is simply a method of laying out the computation, which is much faster for large

values of N, where N is the number of samples in the sequence. FFT is an ingenious way of achieving rather than the DFT's clumsy N 2 timing. To carry out FFT, we will go backwards, starting with the 2-point transform. k 1 k V[k] = W2 v[] + W2 v[1], k=, 1, with the two components be: V[] = W v[] + W v[1] = v[] + W v[1] 2 2 2 1 1 2 v[] + W2 v[1] = v[] + W2 V[1] = W v[1] Here V is a column vector which will store the FFT value of the input array v, and W 2 is the principal 2nd root of unity, which is equal to e i. In the following text, we use W N to represent the principal nth root of unity. We can represent the two equations for the components of the 2-point transform graphically using the, so called, butterfly v[] V[] W 2 v[1] 1 W 2 Fig. 1. Butterfly calculation Furthermore, using the divide and conquer strategy, a 4-point transform can be reduced to two 2-point transforms: one for even elements, one for odd elements. The odd one will be multiplied by W k 4. n Diagrammatically, this can be represented as two levels of butterflies. Notice that using the identity W N/2 = W 2n N, we can always express all the multipliers as powers of the same W N (in this case we choose N=4). v[] V[] V[1] v[2] 2 1 V[1] v[1] 2 V[2] v[3] 2 Fig. 2. Diagrammatical representation of the 4-point Fourier transform calculation In fact, all the butterflies have similar form: s W N 3 V[3] s N / 2 W + N Fig. 3. Generic butterfly graph

s+n/2 s N/2 This graph can be further simplified using the identity: W N = W N W N = -W N s, which is true because N/2 W N = e -2 i(n/2)/n = e -i = cos(- ) + isin(- ) = -1, and here is the simplified butterfly: W -1 s N Fig. 4. Simplified generic butterfly Using this result, we can simplify the 4-point diagram as: v[] V[] v[2] -1 V[1] v[1] -1 V[2] v[3] 1-1 -1 Fig. 5. 4-point FFT calculation V[3] This diagram is the essence of the FFT algorithm. The main trick is that we do not calculate each component of the Fourier transform separately. That would involve unnecessary repetition of a substantial number of calculations. Instead, we do our calculations in stages. At each stage we start with N (in general complex) numbers and "butterfly" them to obtain a new set of N complex numbers. Those numbers, in turn, become the input for the next stage. The calculation of a 4-point FFT involves two stages. The input of the first stage are the 4 original samples. The output of the second stage are the 4 components of the Fourier transform. Notice that each stage involves N/2 complex multiplications (or N real multiplications), N/2 sign inversions (multiplication by -1), and N complex additions. So each stage can be done in O(N) time. The number of stages is log 2 N (which, since N is a power of 2, is the exponent m in N = 2 m ). Altogether, the FFT requires on the order of O(Nlog 2 N) calculations. The new complexity comes from the following derivation. T (N) = 2T(N / 2) + c N = 2 [2T(N / 4) + c N / 2] + c N = 4 T(N / 4) + 2 c N =... k k = 2 T(N / 2 ) + k c N When 2 k k = N, T(N / 2 ) = 1, and k = log2 N, so T(N) = N + log2 N c N = O(N log2 N). Moreover, the calculations can be done in-place, using a single buffer of N complex numbers. The trick is to initialize this buffer with appropriately scrambled samples. For N=4, the order of samples is v[], v[2], v[1], v[3]. In general, according to our basic identity, we first divide the samples into two groups, even ones and odd ones. Applying this division recursively, we split these groups of samples into two groups each by selecting every other sample.

In a summary, the naive implementation of the N-point digital Fourier transform involves calculating the scalar product of the sample buffer (treated as an N-dimensional vector) with N separate basis vectors. Since each scalar product involves N multiplications and N additions, the total time is proportional to N 2 (in other words, it is an O(N 2 ) algorithm). However, it turns out that by cleverly re-arranging these operations, one can optimize the algorithm down to O(Nlog 2 N), which for large N makes a huge difference. The idea behind the FFT is the standard strategy to speed up an algorithm, i.e. to divide and conquer, which will break up the original N point sample into two (N/2) sequences. This is because a series of smaller problems is easier to solve than one large one. The DFT requires (N-1) 2 complex multiplications and N(N- 1) complex additions as opposed to the FFT's approach of breaking it down into a series of 2 point samples which only require 1 multiplication and 2 additions and the recombination of the points which is minimal. 2.3 FFT in AFNI Inside AFNI, there are many functions which will use the FFT algorithm to perform the Fourier Transform. In fact, FFT is a very fundamental function in AFNI. As our first step to parallelize AFNI, we will implement a parallelized version of FFT here. 2.4 MPI MPI, Message-Passing Interface [2], is the most widely used approach to develop a parallel system. Rather than specifying a new language (and hence a new compiler), it has specified a library of functions that can be called from a C or Fortran program. The foundation of this library is a small group of functions that can be used to achieve parallelism by message passing. Message passing function, simply a function that explicitly transmits data from one process to another, is a powerful and very general method of expressing parallelism. Message-passing programs can be used to create extremely efficient parallel programs, and message passing is currently the most widely used method of programming many types of parallel computers. However, its principal drawback is that it is very difficult to design and develop programs using message passing. Indeed, it has been called the assembly language of parallel computing because it forces the programmer to deal with so much detail. The introduction of MPI makes it possible for developers of parallel software to write libraries of parallel programs that are both portable and efficient. Use of these libraries will hide many of the details of parallel programming and, as a consequence, make parallel computing much more accessible to professionals in all branches of science and engineering. 3 Method AFNI contains thousands of source code files, and our focus is on csfft.c. Inside this file, we concentrate on the function of csfft_cox() that performs the FFT and will be called by many other functions. 3.1 General Flow Chart of csfft_cox

SCLINV return fft2 fft4 fft8 fft16 fft32 csfft_cox start fft64 fft128 fft256 fft512 fft124 fft_4dec fft248 fft_4dec fft496 fft_4dec fft8192 fft_4dec fft16384 fft32768 fft_4dec fft_4dec 3n 5n fft_3dec fft_5dec Fig. 6. Flow chart of csfft_cox 3.2 Analysis of fftn functions We analyze the hard-coded part of five fftn functions, i.e. fft2(), fft4(), fft8(), fft16() and fft32() and translate the original code into the parallelized version. Here we take fft4() as an example. xcx[].r = (xcx[].r + xcx[2].r ) + (xcx[1].r + xcx[3].r ) xcx[2].r = (xcx[].r + xcx[2].r ) - (xcx[1].r + xcx[3].r ) xcx[1].r = (xcx[].r - xcx[2].r ) - (xcx[1].r - xcx[3].r ) * csp[2].i xcx[3].r = (xcx[].r - xcx[2].r ) + (xcx[1].r - xcx[3].r ) * csp[2].i For other hard-coded functions, the readers can refer to the Appendix A. 3.3 One-level parallelization There are several options for us to parallel the csfft_cox() function. At present, we adopt the one-level parallelization method, that is, when fft496() calls fft124() and when fft8192() calls fft248().

fftest.c is a source code file written solely for the purpose of testing FFT speed. It will calculate the CPU time, wall clock time (elapsed time), bytes of data being processed and MFLOPs. First of all, we need to set up the related initiation functions of MPI in the main function of fftest.c file. MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &np); MPI_Comm_rank(MPI_COMM_WORLD, &rank); Then in the csfft_cox() function, we deal with 496 and 8192 cases (for the detail information, refer to Appendix B). Finally, back in the main function of fftest.c file, we need to terminate the MPI. MPI_Finalize(); 4 Experiment Results In the following charts, we will show some of our experiment results. Notice that the first row of each chart has three parameters, which will be the FFT length, number of iterations and the dimension of vector (how many FFT per iteration) respectively. 496 * 5, * 1 CPU Time System Time Elapsed Time Average CPU Time Original Code 3.2.4 3.3 2 Processors 3.69 2.2 2.93 16.27 2.2 4 Processors 3.88 1.77 1.57 1.45 5.33 23.53 1.6 496 * 1, * 1 CPU Time System Time Elapsed Time Average CPU Time Original Code 7.93 1.35 13.83 2 Processors 8.54 4.99 7.47 37.7 4.99 4 Processors 7.77 2.88 3.23 3.12 11.15 5.21 3.8 496 * 2, * 1 CPU Time System Time Elapsed Time Average CPU Time Original Code 14.55 1.4 2.52 2 Processors 19.22 8.54 13.89 89.31 8.54 4 Processors 15.65 5.5 5.28 5.88 22.39 13.88 5.55 496 * 1, * 1 CPU Time System Time Elapsed Time Average CPU Time Original Code 74.31 9.38 114.53 2 Processors 75 46.66 64.67 363.83 46.66 4 Processors 82.48 29.4 29.93 32.71 11.58 518.1 3.56

Comprehensive Chart 1 for FFT 496 System Time / Total CPU time System Time/ Rank CPU time Iterations 5, 1, 2, 1, 5, 1, 2, 1, Original Code.1.17.1.13.1.17.1.13 2 Processors.51.55.5.53.79.87.72.86 4 Processors.61.66.69.63 1.37 1.44 1.43 1.34 Comprehensive Chart 2 for FFT 496 Total CPU time / Elapsed Time Rank CPU time / Elapsed Time Iterations 5, 1, 2, 1, 5, 1, 2, 1, Original Code.97.57.71.65.97.57.71.65 2 Processors.35.36.31.33.23.23.22.21 4 Processors.37.34.31.34.15.15.15.16 Comprehensive Chart 3 for FFT 496 sequential CPU time / Average CPU time (except for rank ) sequential CPU time / Rank CPU time Iterations 5, 1, 2, 1, 5, 1, 2, 1, 2 Processors 1.58 1.59 1.7 1.59.87.93.76.99 4 Processors 2. 2.58 2.62 2.43.82 1.2.93.9 8192 * 5, * 1 CPU Time System Time Elapsed Time Average CPU Time Original Code 7.29.15 7.61 2 Processors 1.43 5.7 7.11 28.72 5.7 4 Processors 8.74 3.76 3.98 3.46 1.63 38.69 3.73 8192 * 1, * 1 CPU Time System Time Elapsed Time Average CPU Time Original Code 17.45.33 18.27 2 Processors 19.99 13.44 13.79 74.87 13.44 4 Processors 19.8 7.59 7.57 7.51 21.17 86.57 7.56 8192 * 2, * 1 CPU Time System Time Elapsed Time Average CPU Time Original Code 3.7.9 34.22 2 Processors 39.37 23.13 27.29 142.89 23.13 4 Processors 35.77 15.2 15.15 13.75 42.9 169.48 14.7 8192 * 1, * 1 CPU Time System Time Elapsed Time Average CPU Time Original Code 152.72 3.81 166.58 2 Processors 27.98 114.16 134.52 698.48 114.16 4 Processors 197.67 71 78.35 75.97 228.45 96.3 75.11

Comprehensive Chart 1 for FFT 8192 System Time / Total CPU time System Time/ Rank CPU time Iterations 5, 1, 2, 1, 5, 1, 2, 1, Original Code.2.2.3.2.1.17.1.13 2 Processors.44.41.44.42.68.69.69.65 4 Processors.53.51.54.54 1.22 1.11 1.2 1.16 Comprehensive Chart 2 for FFT 8192 Total CPU time / Elapsed Time Rank CPU time / Elapsed Time Iterations 5, 1, 2, 1, 5, 1, 2, 1, Original Code.96.96.9.92.96.96.9.92 2 Processors.56.45.44.46.36.27.28.3 4 Processors.52.48.47.47.22.22.21.22 Comprehensive Chart 3 for FFT 8192 sequential CPU time / Average CPU time (except for rank ) sequential CPU time / Rank CPU time Iterations 5, 1, 2, 1, 5, 1, 2, 1, 2 Processors 1.28 1.3 1.33 1.34.7.87.78.73 4 Processors 1.95 2.31 2.9 2.3.83.91.86.77 5 Discussion 5.1 The correctness of our code In our driver program, fftest.c, which is used to test the speed of FFT, we read data from a file that contains 2,48, complex numbers. After the FFT and IFFT consequently, we obtain a set of complex numbers that are almost the same as the ones in the original data file. The difference comes from the fact that both the real and imaginary part of those complex numbers are floating point numbers. This result shows that our parallelized version of FFT code functions correctly. 5.2 One-level parallelization verse multi-level parallelization In the original sequential code, the calculation has many data dependencies among the instructions. We analyze the code and find out the formula for each element. Now there is no interlinking among different elements, and that could form the basis of the parallelization. In our first attempt to implement the parallelized code, we try to deal with the most basic functions, i.e. 5 fft functions: fft2, fft4, fft8, fft16 and fft32. According to our original thought, those functions are the ones that will be called by other functions hundreds (maybe thousands) of time. However, after the experiment we found out that the cost spend in communication among all participating nodes and other overhead will kill the speedup obtained by the parallelization itself.

Therefore, we did not parallel those basic functions. Instead, we change our focus to the higher level, that is, when fft496() calls fft124() and when fft8192() calls fft248(). In addition, each time we use 4 processors to distribute the calculation workloads. Also, in our implementation of parallelization, we did not adopt the multi-level technique. The advantage of multi-level parallelization is obvious and straightforward: it will parallel the sequential code to the most possible extent. However, the disadvantage is that this kind of parallelization greatly increases the difficulties in implementation. In addition, the communication cost and other overhead cannot be neglected. In our future work, we might consider the two-level parallelization technique in the implementation. 5.3 Analysis of speedup There are two kinds of time, which we have taken into account in analyzing our experiment result: CPU Time and Wall Clock Time (Elapsed Time). The former is the time spend in the calculation part of the code. The latter is the total elapsed time from the viewpoint of the users. It will be great if both of the above two kinds of time are decreased. However, elapsed time is much more non-deterministic than CPU time and is always out of control of our code. Furthermore, different strategies of assigning tasks among processors will have different results. We have tried two methods of distributing the workload. One is to distribute them into two machines (nodes), the other is to use four machines Both methods will assign the workload evenly among the nodes. In addition, before each running of our program, we will check and make sure that the nodes on which our code is running will have a CPU occupied ratio of less than 2 percent. From our experiment data, we find that in both of the two distribution strategies, CPU time will be decreased to different levels. When we use the data set of 496/2,/1 in four machines, which means the FFT length of 496, 2, iterations and one FFT for each iteration, we will get the maximum speedup in CPU time with a factor of around 2.62. However, here we only consider the average CPU time of the nodes other than the head one. The CPU time on the head node does occupy a certain fraction of the wall clock time and is always a little bit more than the sequential CPU time. We believe that there exist some problems regarding the overhead on the head node because that node is the machine that will distribute the original data and gather the result for further processing such as recombining. Possible reasons include the networking bottleneck, the inefficient way of sending and receiving messages among different nodes, and maybe too many iterations as well. As for wall clock time, both of the two distribution methods have increased it. In addition, the total CPU time is only 31 to 37 percent of wall clock time (for FFT length of 496) and 47 to 52 percent of wall clock time (for FFT length of 8192). Therefore, it is obvious that most of the wall clock time is spent in the system itself, and we are still working in it to find out the exact reason(s). For both strategies, we did not obtain 2/4 times speedup, which is ideal speedup. The main reasons are as follows. First of all, there exist the competitions among different users in the same CPU. Each time we will choose those machines that are least occupied by other users to run our code. However we have no control on those CPUs, like operating systems do. So it could happen that during the running of our process on one CPU, that specific CPU will be chosen by other users and hence more occupied than it appeared to be. Secondly, due to the existing communication cost and some other overhead, it is impossible to obtain the ideal speedup in the real machines. In fact, with the increase of the size of our data set, the speedup increases as well. However, it is not true that the speedup will always continue increasing as we feed more input data to the nodes. We believe that the reason is when the running time increases, the risk of selected nodes being occupied by other processes also increases. Therefore, there is an optimal size of data set that will obtain the largest speedup. In our experiment, the parameters of 496/2,/1 running on four nodes are the best.

6 Conclusions We have parallelized the FFT part of AFNI software package. Our method is based on MPI. The result shows that for the FFT algorithm itself, we obtain a speedup of around 2.62 in CPU time. Future work includes the parallelization of the workhorse analysis tool in AFNI, 3dDeconvolve program, which carries out multiple linear regressions on voxel time series and computes associated statistics. Furthermore, our final goal is to make a near real-time AFNI system. References 1. Jean Talairach and Pierre Tournoux: Co-Planar Stereotaxic Atlas of the Human Brain. Thieme Medical Publishers, New York (1988) 11 121 2. Peter S. Pacheco: Parallel Programming with MPI. Morgan Kaufmann Publishers, San Francisco, California (1997) 11 41 3. Quinn: Parallel Algorithm, Chapter 8 The Fast Fourier Transform. Morgan Kaufmann Publishers, San Francisco, California (1997) 198 213 Appendix A:! fft8 xcx[].r = (xcx[].r + xcx[4].r ) + (xcx[2].r + xcx[6].r ) + [(xcx[1].r + xcx[5].r ) + (xcx[3].r + xcx[7].r )] xcx[4].r = (xcx[].r + xcx[4].r ) + (xcx[2].r + xcx[6].r ) - [(xcx[1].r + xcx[5].r ) + (xcx[3].r + xcx[7].r )] xcx[1].r = (xcx[].r - xcx[4].r ] + csp[2].i * (xcx[2].r - xcx[6].r ] + csp[4].i * [(xcx[1].r - xcx[5].r ] + csp[2].i * (xcx[3].r - xcx[7].r )) xcx[5].r = (xcx[].r - xcx[4].r ) + csp[2].i * (xcx[2].r - xcx[6].r ) - csp[4].i * [(xcx[1].r - xcx[5].r ] + csp[2].i * (xcx[3 - xcx[7].r ]) xcx[2].r = (xcx[].r + xcx[4].r ] - (xcx[2].r + xcx[6].r ] + csp[5].i * [(xcx[1].r + xcx[5].r ] - (xcx[3].r + xcx[7].r )) xcx[6].r = (xcx[].r + xcx[4].r ] - (xcx[2].r + xcx[6].r ] - csp[5].i * [(xcx[1].r + xcx[5].r ] - (xcx[3].r + xcx[7].r )) xcx[3].r = (xcx[].r - xcx[4].r ] - csp[2].i * (xcx[2].r - xcx[6].r ] + csp[6].i * [(xcx[1].r - xcx[5].r ] - csp[2].i * (xcx[3].r - xcx[7].r )) xcx[7].r = (xcx[].r - xcx[4].r ] - csp[2].i * (xcx[2].r - xcx[6].r ] - csp[6].i * [(xcx[1].r - xcx[5].r ] - csp[2].i * (xcx[3].r - xcx[7].r )]! fft16

xcx[].r = [(xcx[].r + xcx[8].r ] + (xcx[4].r + xcx[12].r ]] + [(xcx[2].r + xcx[1].r ] + (xcx[6].r + xcx[14].r ]] + [(xcx[5].r + xcx[13].r ] + (xcx[1].r + xcx[9].r ]] + [(xcx[3].r + xcx[11].r ] + (xcx[7].r + xcx[15].r)) xcx[8].r = [(xcx[].r + xcx[8].r ] + (xcx[4].r + xcx[12].r ]] + [(xcx[2].r + xcx[1].r ] + (xcx[6].r + xcx[14].r ]] - [(xcx[5].r + xcx[13].r ] + (xcx[1].r + xcx[9].r ]] - [(xcx[3].r + xcx[11].r ] + (xcx[7].r + xcx[15].r )) xcx[1].r = [(xcx[].r - xcx[8].r ] + csp[2].i * (xcx[4].r - xcx[12].r ]] + csp[4].i * [(xcx[2].r - xcx[1].r ] + csp[2].i * (xcx[6].r + xcx[14].r ]] + csp[8].i * [csp[2].i * (xcx[5].r - xcx[13].r ] + (xcx[1].r - xcx[9].r ]] + csp[8].i * csp[4].i * [(xcx[3].r - xcx[11].r ) + csp[2].i * (xcx[7].r - xcx[15].r)] xcx[9].r = [(xcx[].r - xcx[8].r ] + csp[2].i * (xcx[4].r - xcx[12].r ]] + csp[4].i * [(xcx[2].r - xcx[1].r ] + csp[2].i * (xcx[6].r + xcx[14].r ]] - csp[8].i * [ csp[2].i * (xcx[5].r - xcx[13].r ] + (xcx[1].r - xcx[9].r ]] - csp[8].i * csp[4].i * [(xcx[3].r - xcx[11].r ) + csp[2].i * (xcx[7].r - xcx[15].r )] xcx[2].r = [(xcx[].r + xcx[8].r ] - ( xcx[4].r + xcx[12].r ]] + csp[5].i) * [(xcx[2].r + xcx[1].r ] - (xcx[6].r + xcx[14].r ]] + csp[9].i * [(xcx[5].r + xcx[13].r ] - (xcx[1].r + xcx[9].r ]] + csp[9].i * csp[5].i * [(xcx[3].r + xcx[11].r ) - (xcx[7].r + xcx[15].r )] xcx[1].r = [(xcx[].r + xcx[8].r ] - (xcx[4].r + xcx[12].r ]] + csp[5].i * [(xcx[2].r + xcx[1].r ] - (xcx[6].r + xcx[14].r ]] - csp[9].i * [(xcx[5].r + xcx[13].r ] - (xcx[1].r + xcx[9].r ]] - csp[9].i * csp[5].i * [(xcx[3].r + xcx[11].r ] - (xcx[7].r + xcx[15].r )) xcx[3].r = [(xcx[].r - xcx[8].r ] - csp[2].i * (xcx[4].r - xcx[12].r ]] + csp[6].i * [(xcx[2].r - xcx[1].r ] - csp[2].i * (xcx[6].r - xcx[14].r ]] + csp[1].i * [ - csp[2].i * (xcx[5].r - xcx[13].r ] + (xcx[1].r - xcx[9].r ]] + csp[1].i * csp[6].i * [(xcx[3].r - xcx[11].r ) - csp[2].i * (xcx[7].r - xcx[15].r )] xcx[11].r = [(xcx[].r - xcx[8].r ] - csp[2].i * (xcx[4].r - xcx[12].r ]] + csp[6].i * [(xcx[2].r - xcx[1].r ] - csp[2].i * (xcx[6].r - xcx[14].r ]] - csp[1].i * [ - csp[2].i * (xcx[5].r - xcx[13].r ] + (xcx[1].r - xcx[9].r ]] - csp[1].i * csp[6].i * [(xcx[3].r - xcx[11].r ) - csp[2].i * (xcx[7].r - xcx[15].r )] xcx[4].r = [(xcx[].r + xcx[8].r ] + (xcx[4].r + xcx[12].r ]] - [(xcx[2].r + xcx[1].r ] + (xcx[6].r + xcx[14].r ]] + csp[11].i * [(xcx[5].r + xcx[13].r ] + (xcx[1].r + xcx[9].r ]] - csp[11].i * [(xcx[3].r + xcx[11].r ] + (xcx[7].r + xcx[15].r )] xcx[12].r = [(xcx[].r + xcx[8].r ] + (xcx[4].r + xcx[12].r ]] - [(xcx[2].r + xcx[1].r ] + (xcx[6].r + xcx[14].r ]] - csp[11].i *

[(xcx[5].r + xcx[13].r ] + (xcx[1].r + xcx[9].r ]] + csp[11].i * [(xcx[3].r + xcx[11].r ] + (xcx[7].r + xcx[15].r )) xcx[5].r = [(xcx[].r - xcx[8].r ] + csp[2].i * (xcx[4].r - xcx[12].r ]] - csp[4].i * [(xcx[2].r - xcx[1].r ] + csp[2].i * (xcx[6].r + xcx[14].r ]] + csp[12].i * [ csp[2].i * (xcx[5].r - xcx[13].r ] + (xcx[1].r - xcx[9].r ]] - csp[12].i * csp[4].i * [(xcx[3].r - xcx[11].r ) + csp[2].i * (xcx[7].r - xcx[15].r )] xcx[13].r = [(xcx[].r - xcx[8].r ] + csp[2].i * (xcx[4].r - xcx[12].r ]] - csp[4].i * [(xcx[2].r - xcx[1].r ] + csp[2].i * (xcx[6].r + xcx[14].r ]] - csp[12].i * [ csp[2].i * (xcx[5].r - xcx[13].r ] + (xcx[1].r - xcx[9].r ]] + csp[12].i * csp[4].i * [(xcx[3].r - xcx[11].r ) + csp[2].i * (xcx[7].r - xcx[15].r )] xcx[6].r = [(xcx[].r + xcx[8].r ] - (xcx[4].r + xcx[12].r ]] - csp[5].i * [(xcx[2].r + xcx[1].r ] - (xcx[6].r + xcx[14].r ]] + csp[13].i * [(xcx[5].r + xcx[13].r ] - (xcx[1].r + xcx[9].r ]] - csp[13].i * csp[5].i * [(xcx[3].r + xcx[11].r ) - (xcx[7].r + xcx[15].r )] xcx[14].r = [(xcx[].r + xcx[8].r ] - ( xcx[4].r + xcx[12].r ]] - csp[5].i * [( xcx[2].r + xcx[1].r ] - (xcx[6].r + xcx[14].r ]] - csp[13].i * [(xcx[5].r + xcx[13].r ] - (xcx[1].r + xcx[9].r ]] + csp[13].i * csp[5].i * [(xcx[3].r + xcx[11].r ) - ( xcx[7].r + xcx[15].r )] xcx[7].r = [(xcx[].r - xcx[8].r ] - csp[2].i * (xcx[4].r - xcx[12].r ]] - csp[6].i * [(xcx[2].r - xcx[1].r ] - csp[2].i * (xcx[6].r - xcx[14].r ]] + csp[14].i * [ - csp[2].i * (xcx[5].r - xcx[13].r ] + (xcx[1].r - xcx[9].r ]] - csp[14].i * csp[6].i * [(xcx[3].r - xcx[11].r ] - csp[2].i * (xcx[7].r - xcx[15].r ]] xcx[15].r = [(xcx[].r - xcx[8].r ] - csp[2].i * (xcx[4].r - xcx[12].r ]] - csp[6].i * [(xcx[2].r - xcx[1].r ] - csp[2].i * (xcx[6].r - xcx[14].r ]] - csp[14].i * [ - csp[2].i * (xcx[5].r - xcx[13].r ] + (xcx[1].r - xcx[9].r ]] + csp[14].i * csp[6].i * [(xcx[3].r - xcx[11].r ) - csp[2].i * (xcx[7].r - xcx[15].r )] Appendix B: case 496: // modified by Huang and Xi // calculate array aa if(rank == ) fft_4dec(mode,124,aa); // calculate arrays bb, cc and dd, then split them into two //subarrays each

if(rank == 1) fft_4dec(mode,124,bb); for(i = ; i < M; i++) bb_r[i] = (bb[i]).r; bb_i[i] = (bb[i]).i; if(rank == 2) fft_4dec(mode,124,cc); for(i = ; i < M; i++) cc_r[i] = (cc[i]).r; cc_i[i] = (cc[i]).i; if(rank == 3) fft_4dec(mode,124,dd); for(i = ; i < M; i++) dd_r[i] = (dd[i]).r; dd_i[i] = (dd[i]).i; MPI_Barrier(MPI_COMM_WORLD); // send and receive the arrays if (rank == 1) MPI_Send(bb_r, M, MPI_FLOAT,,, MPI_COMM_WORLD); MPI_Send(bb_i, M, MPI_FLOAT,,, MPI_COMM_WORLD); if (rank == 2) MPI_Send(cc_r, M, MPI_FLOAT,,, MPI_COMM_WORLD); MPI_Send(cc_i, M, MPI_FLOAT,,, MPI_COMM_WORLD);

if (rank == 3) MPI_Send(dd_r, M, MPI_FLOAT,,, MPI_COMM_WORLD); MPI_Send(dd_i, M, MPI_FLOAT,,, MPI_COMM_WORLD); if (rank == ) MPI_Recv(bb_r,M, MPI_FLOAT, 1,, MPI_COMM_WORLD, &status); MPI_Recv(bb_i,M, MPI_FLOAT, 1,, MPI_COMM_WORLD, &status); MPI_Recv(cc_r,M, MPI_FLOAT, 2,, MPI_COMM_WORLD, &status); MPI_Recv(cc_i,M, MPI_FLOAT, 2,, MPI_COMM_WORLD, &status); MPI_Recv(dd_r,M, MPI_FLOAT, 3,, MPI_COMM_WORLD, &status); MPI_Recv(dd_i,M, MPI_FLOAT, 3,, MPI_COMM_WORLD, &status); // recombine bb, cc and dd for (i = ; i < M; i++) bb[i].r = bb_r[i]; bb[i].i = bb_i[i]; cc[i].r = cc_r[i]; cc[i].i = cc_i[i]; dd[i].r = dd_r[i]; dd[i].i = dd_i[i]; break; case 8192: // modified by Huang and Xi // calculate array aa if(rank == ) fft_4dec(mode,248,aa); // calculate arrays bb, cc and dd, then split them into two //subarrays each if(rank == 1) fft_4dec(mode,248,bb); for(i = ; i < M; i++) bb_r[i] = (bb[i]).r; bb_i[i] = (bb[i]).i;

if(rank == 2) fft_4dec(mode,248,cc); for(i = ; i < M; i++) cc_r[i] = (cc[i]).r; cc_i[i] = (cc[i]).i; if(rank == 3) fft_4dec(mode,248,dd); for(i = ; i < M; i++) dd_r[i] = (dd[i]).r; dd_i[i] = (dd[i]).i; MPI_Barrier(MPI_COMM_WORLD); // send and receive the arrays if (rank == 1) MPI_Send(bb_r, M, MPI_FLOAT,,, MPI_COMM_WORLD); MPI_Send(bb_i, M, MPI_FLOAT,,, MPI_COMM_WORLD); if (rank == 2) MPI_Send(cc_r, M, MPI_FLOAT,,, MPI_COMM_WORLD); MPI_Send(cc_i, M, MPI_FLOAT,,, MPI_COMM_WORLD); if (rank == 3) MPI_Send(dd_r, M, MPI_FLOAT,,, MPI_COMM_WORLD); MPI_Send(dd_i, M, MPI_FLOAT,,, MPI_COMM_WORLD);

if (rank == ) MPI_Recv(bb_r,M, MPI_FLOAT, 1,, MPI_COMM_WORLD, &status); MPI_Recv(bb_i,M, MPI_FLOAT, 1,, MPI_COMM_WORLD, &status); MPI_Recv(cc_r,M, MPI_FLOAT, 2,, MPI_COMM_WORLD, &status); MPI_Recv(cc_i,M, MPI_FLOAT, 2,, MPI_COMM_WORLD, &status); MPI_Recv(dd_r,M, MPI_FLOAT, 3,, MPI_COMM_WORLD, &status); MPI_Recv(dd_i,M, MPI_FLOAT, 3,, MPI_COMM_WORLD, &status); // recombine bb, cc and dd for (i = ; i < M; i++) bb[i].r = bb_r[i]; bb[i].i = bb_i[i]; cc[i].r = cc_r[i]; cc[i].i = cc_i[i]; dd[i].r = dd_r[i]; dd[i].i = dd_i[i]; break;