Parallelization of FFT in AFNI

Size: px
Start display at page:

Download "Parallelization of FFT in AFNI"

Transcription

1 Parallelization of FFT in AFNI Huang, Jingshan and Xi, Hong Department of Computer Science and Engineering University of South Carolina Columbia, SC and Abstract. AFNI is a widely used software package for medical image processing. However it is not a real-time system. We are trying to make a parallelized version of AFNI, which will be nearly real-time. As the first step of this enormous task, we parallelize the FFT part of AFNI. Our method is based on MPI. The result shows that for the FFT algorithm itself, we obtain a speedup of around 2.62 in CPU time. 1 Introduction AFNI is a widely used software package for medical image processing. However, there is a big drawback of this system: it is not a real-time system, which to some extent has impaired the application in the related areas. Our ultimate goal is to make a parallelized version of AFNI, which will have the property of nearly realtime. As our first step, we would like to parallelize the FFT part of AFNI. One of the reasons that we choose FFT is that inside AFNI, FFT is extensively called by other functions. In fact, FFT is a fundamental function in AFNI. Therefore, to obtain a parallel version of FFT in AFNI will definitely have great significance in achieving our final goal. 2 Background 2.1 AFNI AFNI stands for Analysis of Functional NeuroImages. It is a set of C programs (over 1, source code files) for processing, analyzing, and displaying functional MRI (FMRI) data - a technique for mapping human brain activity. AFNI runs on Unix+X11+Motif systems, including SGI, Solaris, Linux, and Mac OS X. AFNI is an interactive program for viewing the results of 3D functional neuroimaging. It can overlay the (usually) low-resolution results of functional brain scans onto higher resolution structural volume data sets. By marking fiducial points, one may transform the data to the proportional grid (stereotaxic coordinates) of Talairach and Tournoux [1]. Time-dependent 3D volume data sets can also be created and viewed. In addition, there are some auxiliary programs provided for combining and editing 3D and 3D+time functional data sets. Although being widely used in the area of medical image processing, AFNI is not a real-time software package, which in some degree influences its application in some cases. 2.2 FFT[3] FFT, Fast Fourier Transform, is a discrete Fourier transform algorithm from discrete time domain to discrete spatial domain. It is simply a method of laying out the computation, which is much faster for large

2 values of N, where N is the number of samples in the sequence. FFT is an ingenious way of achieving rather than the DFT's clumsy N 2 timing. To carry out FFT, we will go backwards, starting with the 2-point transform. k 1 k V[k] = W2 v[] + W2 v[1], k=, 1, with the two components be: V[] = W v[] + W v[1] = v[] + W v[1] v[] + W2 v[1] = v[] + W2 V[1] = W v[1] Here V is a column vector which will store the FFT value of the input array v, and W 2 is the principal 2nd root of unity, which is equal to e i. In the following text, we use W N to represent the principal nth root of unity. We can represent the two equations for the components of the 2-point transform graphically using the, so called, butterfly v[] V[] W 2 v[1] 1 W 2 Fig. 1. Butterfly calculation Furthermore, using the divide and conquer strategy, a 4-point transform can be reduced to two 2-point transforms: one for even elements, one for odd elements. The odd one will be multiplied by W k 4. n Diagrammatically, this can be represented as two levels of butterflies. Notice that using the identity W N/2 = W 2n N, we can always express all the multipliers as powers of the same W N (in this case we choose N=4). v[] V[] V[1] v[2] 2 1 V[1] v[1] 2 V[2] v[3] 2 Fig. 2. Diagrammatical representation of the 4-point Fourier transform calculation In fact, all the butterflies have similar form: s W N 3 V[3] s N / 2 W + N Fig. 3. Generic butterfly graph

3 s+n/2 s N/2 This graph can be further simplified using the identity: W N = W N W N = -W N s, which is true because N/2 W N = e -2 i(n/2)/n = e -i = cos(- ) + isin(- ) = -1, and here is the simplified butterfly: W -1 s N Fig. 4. Simplified generic butterfly Using this result, we can simplify the 4-point diagram as: v[] V[] v[2] -1 V[1] v[1] -1 V[2] v[3] Fig point FFT calculation V[3] This diagram is the essence of the FFT algorithm. The main trick is that we do not calculate each component of the Fourier transform separately. That would involve unnecessary repetition of a substantial number of calculations. Instead, we do our calculations in stages. At each stage we start with N (in general complex) numbers and "butterfly" them to obtain a new set of N complex numbers. Those numbers, in turn, become the input for the next stage. The calculation of a 4-point FFT involves two stages. The input of the first stage are the 4 original samples. The output of the second stage are the 4 components of the Fourier transform. Notice that each stage involves N/2 complex multiplications (or N real multiplications), N/2 sign inversions (multiplication by -1), and N complex additions. So each stage can be done in O(N) time. The number of stages is log 2 N (which, since N is a power of 2, is the exponent m in N = 2 m ). Altogether, the FFT requires on the order of O(Nlog 2 N) calculations. The new complexity comes from the following derivation. T (N) = 2T(N / 2) + c N = 2 [2T(N / 4) + c N / 2] + c N = 4 T(N / 4) + 2 c N =... k k = 2 T(N / 2 ) + k c N When 2 k k = N, T(N / 2 ) = 1, and k = log2 N, so T(N) = N + log2 N c N = O(N log2 N). Moreover, the calculations can be done in-place, using a single buffer of N complex numbers. The trick is to initialize this buffer with appropriately scrambled samples. For N=4, the order of samples is v[], v[2], v[1], v[3]. In general, according to our basic identity, we first divide the samples into two groups, even ones and odd ones. Applying this division recursively, we split these groups of samples into two groups each by selecting every other sample.

4 In a summary, the naive implementation of the N-point digital Fourier transform involves calculating the scalar product of the sample buffer (treated as an N-dimensional vector) with N separate basis vectors. Since each scalar product involves N multiplications and N additions, the total time is proportional to N 2 (in other words, it is an O(N 2 ) algorithm). However, it turns out that by cleverly re-arranging these operations, one can optimize the algorithm down to O(Nlog 2 N), which for large N makes a huge difference. The idea behind the FFT is the standard strategy to speed up an algorithm, i.e. to divide and conquer, which will break up the original N point sample into two (N/2) sequences. This is because a series of smaller problems is easier to solve than one large one. The DFT requires (N-1) 2 complex multiplications and N(N- 1) complex additions as opposed to the FFT's approach of breaking it down into a series of 2 point samples which only require 1 multiplication and 2 additions and the recombination of the points which is minimal. 2.3 FFT in AFNI Inside AFNI, there are many functions which will use the FFT algorithm to perform the Fourier Transform. In fact, FFT is a very fundamental function in AFNI. As our first step to parallelize AFNI, we will implement a parallelized version of FFT here. 2.4 MPI MPI, Message-Passing Interface [2], is the most widely used approach to develop a parallel system. Rather than specifying a new language (and hence a new compiler), it has specified a library of functions that can be called from a C or Fortran program. The foundation of this library is a small group of functions that can be used to achieve parallelism by message passing. Message passing function, simply a function that explicitly transmits data from one process to another, is a powerful and very general method of expressing parallelism. Message-passing programs can be used to create extremely efficient parallel programs, and message passing is currently the most widely used method of programming many types of parallel computers. However, its principal drawback is that it is very difficult to design and develop programs using message passing. Indeed, it has been called the assembly language of parallel computing because it forces the programmer to deal with so much detail. The introduction of MPI makes it possible for developers of parallel software to write libraries of parallel programs that are both portable and efficient. Use of these libraries will hide many of the details of parallel programming and, as a consequence, make parallel computing much more accessible to professionals in all branches of science and engineering. 3 Method AFNI contains thousands of source code files, and our focus is on csfft.c. Inside this file, we concentrate on the function of csfft_cox() that performs the FFT and will be called by many other functions. 3.1 General Flow Chart of csfft_cox

5 SCLINV return fft2 fft4 fft8 fft16 fft32 csfft_cox start fft64 fft128 fft256 fft512 fft124 fft_4dec fft248 fft_4dec fft496 fft_4dec fft8192 fft_4dec fft16384 fft32768 fft_4dec fft_4dec 3n 5n fft_3dec fft_5dec Fig. 6. Flow chart of csfft_cox 3.2 Analysis of fftn functions We analyze the hard-coded part of five fftn functions, i.e. fft2(), fft4(), fft8(), fft16() and fft32() and translate the original code into the parallelized version. Here we take fft4() as an example. xcx[].r = (xcx[].r + xcx[2].r ) + (xcx[1].r + xcx[3].r ) xcx[2].r = (xcx[].r + xcx[2].r ) - (xcx[1].r + xcx[3].r ) xcx[1].r = (xcx[].r - xcx[2].r ) - (xcx[1].r - xcx[3].r ) * csp[2].i xcx[3].r = (xcx[].r - xcx[2].r ) + (xcx[1].r - xcx[3].r ) * csp[2].i For other hard-coded functions, the readers can refer to the Appendix A. 3.3 One-level parallelization There are several options for us to parallel the csfft_cox() function. At present, we adopt the one-level parallelization method, that is, when fft496() calls fft124() and when fft8192() calls fft248().

6 fftest.c is a source code file written solely for the purpose of testing FFT speed. It will calculate the CPU time, wall clock time (elapsed time), bytes of data being processed and MFLOPs. First of all, we need to set up the related initiation functions of MPI in the main function of fftest.c file. MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &np); MPI_Comm_rank(MPI_COMM_WORLD, &rank); Then in the csfft_cox() function, we deal with 496 and 8192 cases (for the detail information, refer to Appendix B). Finally, back in the main function of fftest.c file, we need to terminate the MPI. MPI_Finalize(); 4 Experiment Results In the following charts, we will show some of our experiment results. Notice that the first row of each chart has three parameters, which will be the FFT length, number of iterations and the dimension of vector (how many FFT per iteration) respectively. 496 * 5, * 1 CPU Time System Time Elapsed Time Average CPU Time Original Code Processors Processors * 1, * 1 CPU Time System Time Elapsed Time Average CPU Time Original Code Processors Processors * 2, * 1 CPU Time System Time Elapsed Time Average CPU Time Original Code Processors Processors * 1, * 1 CPU Time System Time Elapsed Time Average CPU Time Original Code Processors Processors

7 Comprehensive Chart 1 for FFT 496 System Time / Total CPU time System Time/ Rank CPU time Iterations 5, 1, 2, 1, 5, 1, 2, 1, Original Code Processors Processors Comprehensive Chart 2 for FFT 496 Total CPU time / Elapsed Time Rank CPU time / Elapsed Time Iterations 5, 1, 2, 1, 5, 1, 2, 1, Original Code Processors Processors Comprehensive Chart 3 for FFT 496 sequential CPU time / Average CPU time (except for rank ) sequential CPU time / Rank CPU time Iterations 5, 1, 2, 1, 5, 1, 2, 1, 2 Processors Processors * 5, * 1 CPU Time System Time Elapsed Time Average CPU Time Original Code Processors Processors * 1, * 1 CPU Time System Time Elapsed Time Average CPU Time Original Code Processors Processors * 2, * 1 CPU Time System Time Elapsed Time Average CPU Time Original Code Processors Processors * 1, * 1 CPU Time System Time Elapsed Time Average CPU Time Original Code Processors Processors

8 Comprehensive Chart 1 for FFT 8192 System Time / Total CPU time System Time/ Rank CPU time Iterations 5, 1, 2, 1, 5, 1, 2, 1, Original Code Processors Processors Comprehensive Chart 2 for FFT 8192 Total CPU time / Elapsed Time Rank CPU time / Elapsed Time Iterations 5, 1, 2, 1, 5, 1, 2, 1, Original Code Processors Processors Comprehensive Chart 3 for FFT 8192 sequential CPU time / Average CPU time (except for rank ) sequential CPU time / Rank CPU time Iterations 5, 1, 2, 1, 5, 1, 2, 1, 2 Processors Processors Discussion 5.1 The correctness of our code In our driver program, fftest.c, which is used to test the speed of FFT, we read data from a file that contains 2,48, complex numbers. After the FFT and IFFT consequently, we obtain a set of complex numbers that are almost the same as the ones in the original data file. The difference comes from the fact that both the real and imaginary part of those complex numbers are floating point numbers. This result shows that our parallelized version of FFT code functions correctly. 5.2 One-level parallelization verse multi-level parallelization In the original sequential code, the calculation has many data dependencies among the instructions. We analyze the code and find out the formula for each element. Now there is no interlinking among different elements, and that could form the basis of the parallelization. In our first attempt to implement the parallelized code, we try to deal with the most basic functions, i.e. 5 fft functions: fft2, fft4, fft8, fft16 and fft32. According to our original thought, those functions are the ones that will be called by other functions hundreds (maybe thousands) of time. However, after the experiment we found out that the cost spend in communication among all participating nodes and other overhead will kill the speedup obtained by the parallelization itself.

9 Therefore, we did not parallel those basic functions. Instead, we change our focus to the higher level, that is, when fft496() calls fft124() and when fft8192() calls fft248(). In addition, each time we use 4 processors to distribute the calculation workloads. Also, in our implementation of parallelization, we did not adopt the multi-level technique. The advantage of multi-level parallelization is obvious and straightforward: it will parallel the sequential code to the most possible extent. However, the disadvantage is that this kind of parallelization greatly increases the difficulties in implementation. In addition, the communication cost and other overhead cannot be neglected. In our future work, we might consider the two-level parallelization technique in the implementation. 5.3 Analysis of speedup There are two kinds of time, which we have taken into account in analyzing our experiment result: CPU Time and Wall Clock Time (Elapsed Time). The former is the time spend in the calculation part of the code. The latter is the total elapsed time from the viewpoint of the users. It will be great if both of the above two kinds of time are decreased. However, elapsed time is much more non-deterministic than CPU time and is always out of control of our code. Furthermore, different strategies of assigning tasks among processors will have different results. We have tried two methods of distributing the workload. One is to distribute them into two machines (nodes), the other is to use four machines Both methods will assign the workload evenly among the nodes. In addition, before each running of our program, we will check and make sure that the nodes on which our code is running will have a CPU occupied ratio of less than 2 percent. From our experiment data, we find that in both of the two distribution strategies, CPU time will be decreased to different levels. When we use the data set of 496/2,/1 in four machines, which means the FFT length of 496, 2, iterations and one FFT for each iteration, we will get the maximum speedup in CPU time with a factor of around However, here we only consider the average CPU time of the nodes other than the head one. The CPU time on the head node does occupy a certain fraction of the wall clock time and is always a little bit more than the sequential CPU time. We believe that there exist some problems regarding the overhead on the head node because that node is the machine that will distribute the original data and gather the result for further processing such as recombining. Possible reasons include the networking bottleneck, the inefficient way of sending and receiving messages among different nodes, and maybe too many iterations as well. As for wall clock time, both of the two distribution methods have increased it. In addition, the total CPU time is only 31 to 37 percent of wall clock time (for FFT length of 496) and 47 to 52 percent of wall clock time (for FFT length of 8192). Therefore, it is obvious that most of the wall clock time is spent in the system itself, and we are still working in it to find out the exact reason(s). For both strategies, we did not obtain 2/4 times speedup, which is ideal speedup. The main reasons are as follows. First of all, there exist the competitions among different users in the same CPU. Each time we will choose those machines that are least occupied by other users to run our code. However we have no control on those CPUs, like operating systems do. So it could happen that during the running of our process on one CPU, that specific CPU will be chosen by other users and hence more occupied than it appeared to be. Secondly, due to the existing communication cost and some other overhead, it is impossible to obtain the ideal speedup in the real machines. In fact, with the increase of the size of our data set, the speedup increases as well. However, it is not true that the speedup will always continue increasing as we feed more input data to the nodes. We believe that the reason is when the running time increases, the risk of selected nodes being occupied by other processes also increases. Therefore, there is an optimal size of data set that will obtain the largest speedup. In our experiment, the parameters of 496/2,/1 running on four nodes are the best.

10 6 Conclusions We have parallelized the FFT part of AFNI software package. Our method is based on MPI. The result shows that for the FFT algorithm itself, we obtain a speedup of around 2.62 in CPU time. Future work includes the parallelization of the workhorse analysis tool in AFNI, 3dDeconvolve program, which carries out multiple linear regressions on voxel time series and computes associated statistics. Furthermore, our final goal is to make a near real-time AFNI system. References 1. Jean Talairach and Pierre Tournoux: Co-Planar Stereotaxic Atlas of the Human Brain. Thieme Medical Publishers, New York (1988) Peter S. Pacheco: Parallel Programming with MPI. Morgan Kaufmann Publishers, San Francisco, California (1997) Quinn: Parallel Algorithm, Chapter 8 The Fast Fourier Transform. Morgan Kaufmann Publishers, San Francisco, California (1997) Appendix A:! fft8 xcx[].r = (xcx[].r + xcx[4].r ) + (xcx[2].r + xcx[6].r ) + [(xcx[1].r + xcx[5].r ) + (xcx[3].r + xcx[7].r )] xcx[4].r = (xcx[].r + xcx[4].r ) + (xcx[2].r + xcx[6].r ) - [(xcx[1].r + xcx[5].r ) + (xcx[3].r + xcx[7].r )] xcx[1].r = (xcx[].r - xcx[4].r ] + csp[2].i * (xcx[2].r - xcx[6].r ] + csp[4].i * [(xcx[1].r - xcx[5].r ] + csp[2].i * (xcx[3].r - xcx[7].r )) xcx[5].r = (xcx[].r - xcx[4].r ) + csp[2].i * (xcx[2].r - xcx[6].r ) - csp[4].i * [(xcx[1].r - xcx[5].r ] + csp[2].i * (xcx[3 - xcx[7].r ]) xcx[2].r = (xcx[].r + xcx[4].r ] - (xcx[2].r + xcx[6].r ] + csp[5].i * [(xcx[1].r + xcx[5].r ] - (xcx[3].r + xcx[7].r )) xcx[6].r = (xcx[].r + xcx[4].r ] - (xcx[2].r + xcx[6].r ] - csp[5].i * [(xcx[1].r + xcx[5].r ] - (xcx[3].r + xcx[7].r )) xcx[3].r = (xcx[].r - xcx[4].r ] - csp[2].i * (xcx[2].r - xcx[6].r ] + csp[6].i * [(xcx[1].r - xcx[5].r ] - csp[2].i * (xcx[3].r - xcx[7].r )) xcx[7].r = (xcx[].r - xcx[4].r ] - csp[2].i * (xcx[2].r - xcx[6].r ] - csp[6].i * [(xcx[1].r - xcx[5].r ] - csp[2].i * (xcx[3].r - xcx[7].r )]! fft16

11 xcx[].r = [(xcx[].r + xcx[8].r ] + (xcx[4].r + xcx[12].r ]] + [(xcx[2].r + xcx[1].r ] + (xcx[6].r + xcx[14].r ]] + [(xcx[5].r + xcx[13].r ] + (xcx[1].r + xcx[9].r ]] + [(xcx[3].r + xcx[11].r ] + (xcx[7].r + xcx[15].r)) xcx[8].r = [(xcx[].r + xcx[8].r ] + (xcx[4].r + xcx[12].r ]] + [(xcx[2].r + xcx[1].r ] + (xcx[6].r + xcx[14].r ]] - [(xcx[5].r + xcx[13].r ] + (xcx[1].r + xcx[9].r ]] - [(xcx[3].r + xcx[11].r ] + (xcx[7].r + xcx[15].r )) xcx[1].r = [(xcx[].r - xcx[8].r ] + csp[2].i * (xcx[4].r - xcx[12].r ]] + csp[4].i * [(xcx[2].r - xcx[1].r ] + csp[2].i * (xcx[6].r + xcx[14].r ]] + csp[8].i * [csp[2].i * (xcx[5].r - xcx[13].r ] + (xcx[1].r - xcx[9].r ]] + csp[8].i * csp[4].i * [(xcx[3].r - xcx[11].r ) + csp[2].i * (xcx[7].r - xcx[15].r)] xcx[9].r = [(xcx[].r - xcx[8].r ] + csp[2].i * (xcx[4].r - xcx[12].r ]] + csp[4].i * [(xcx[2].r - xcx[1].r ] + csp[2].i * (xcx[6].r + xcx[14].r ]] - csp[8].i * [ csp[2].i * (xcx[5].r - xcx[13].r ] + (xcx[1].r - xcx[9].r ]] - csp[8].i * csp[4].i * [(xcx[3].r - xcx[11].r ) + csp[2].i * (xcx[7].r - xcx[15].r )] xcx[2].r = [(xcx[].r + xcx[8].r ] - ( xcx[4].r + xcx[12].r ]] + csp[5].i) * [(xcx[2].r + xcx[1].r ] - (xcx[6].r + xcx[14].r ]] + csp[9].i * [(xcx[5].r + xcx[13].r ] - (xcx[1].r + xcx[9].r ]] + csp[9].i * csp[5].i * [(xcx[3].r + xcx[11].r ) - (xcx[7].r + xcx[15].r )] xcx[1].r = [(xcx[].r + xcx[8].r ] - (xcx[4].r + xcx[12].r ]] + csp[5].i * [(xcx[2].r + xcx[1].r ] - (xcx[6].r + xcx[14].r ]] - csp[9].i * [(xcx[5].r + xcx[13].r ] - (xcx[1].r + xcx[9].r ]] - csp[9].i * csp[5].i * [(xcx[3].r + xcx[11].r ] - (xcx[7].r + xcx[15].r )) xcx[3].r = [(xcx[].r - xcx[8].r ] - csp[2].i * (xcx[4].r - xcx[12].r ]] + csp[6].i * [(xcx[2].r - xcx[1].r ] - csp[2].i * (xcx[6].r - xcx[14].r ]] + csp[1].i * [ - csp[2].i * (xcx[5].r - xcx[13].r ] + (xcx[1].r - xcx[9].r ]] + csp[1].i * csp[6].i * [(xcx[3].r - xcx[11].r ) - csp[2].i * (xcx[7].r - xcx[15].r )] xcx[11].r = [(xcx[].r - xcx[8].r ] - csp[2].i * (xcx[4].r - xcx[12].r ]] + csp[6].i * [(xcx[2].r - xcx[1].r ] - csp[2].i * (xcx[6].r - xcx[14].r ]] - csp[1].i * [ - csp[2].i * (xcx[5].r - xcx[13].r ] + (xcx[1].r - xcx[9].r ]] - csp[1].i * csp[6].i * [(xcx[3].r - xcx[11].r ) - csp[2].i * (xcx[7].r - xcx[15].r )] xcx[4].r = [(xcx[].r + xcx[8].r ] + (xcx[4].r + xcx[12].r ]] - [(xcx[2].r + xcx[1].r ] + (xcx[6].r + xcx[14].r ]] + csp[11].i * [(xcx[5].r + xcx[13].r ] + (xcx[1].r + xcx[9].r ]] - csp[11].i * [(xcx[3].r + xcx[11].r ] + (xcx[7].r + xcx[15].r )] xcx[12].r = [(xcx[].r + xcx[8].r ] + (xcx[4].r + xcx[12].r ]] - [(xcx[2].r + xcx[1].r ] + (xcx[6].r + xcx[14].r ]] - csp[11].i *

12 [(xcx[5].r + xcx[13].r ] + (xcx[1].r + xcx[9].r ]] + csp[11].i * [(xcx[3].r + xcx[11].r ] + (xcx[7].r + xcx[15].r )) xcx[5].r = [(xcx[].r - xcx[8].r ] + csp[2].i * (xcx[4].r - xcx[12].r ]] - csp[4].i * [(xcx[2].r - xcx[1].r ] + csp[2].i * (xcx[6].r + xcx[14].r ]] + csp[12].i * [ csp[2].i * (xcx[5].r - xcx[13].r ] + (xcx[1].r - xcx[9].r ]] - csp[12].i * csp[4].i * [(xcx[3].r - xcx[11].r ) + csp[2].i * (xcx[7].r - xcx[15].r )] xcx[13].r = [(xcx[].r - xcx[8].r ] + csp[2].i * (xcx[4].r - xcx[12].r ]] - csp[4].i * [(xcx[2].r - xcx[1].r ] + csp[2].i * (xcx[6].r + xcx[14].r ]] - csp[12].i * [ csp[2].i * (xcx[5].r - xcx[13].r ] + (xcx[1].r - xcx[9].r ]] + csp[12].i * csp[4].i * [(xcx[3].r - xcx[11].r ) + csp[2].i * (xcx[7].r - xcx[15].r )] xcx[6].r = [(xcx[].r + xcx[8].r ] - (xcx[4].r + xcx[12].r ]] - csp[5].i * [(xcx[2].r + xcx[1].r ] - (xcx[6].r + xcx[14].r ]] + csp[13].i * [(xcx[5].r + xcx[13].r ] - (xcx[1].r + xcx[9].r ]] - csp[13].i * csp[5].i * [(xcx[3].r + xcx[11].r ) - (xcx[7].r + xcx[15].r )] xcx[14].r = [(xcx[].r + xcx[8].r ] - ( xcx[4].r + xcx[12].r ]] - csp[5].i * [( xcx[2].r + xcx[1].r ] - (xcx[6].r + xcx[14].r ]] - csp[13].i * [(xcx[5].r + xcx[13].r ] - (xcx[1].r + xcx[9].r ]] + csp[13].i * csp[5].i * [(xcx[3].r + xcx[11].r ) - ( xcx[7].r + xcx[15].r )] xcx[7].r = [(xcx[].r - xcx[8].r ] - csp[2].i * (xcx[4].r - xcx[12].r ]] - csp[6].i * [(xcx[2].r - xcx[1].r ] - csp[2].i * (xcx[6].r - xcx[14].r ]] + csp[14].i * [ - csp[2].i * (xcx[5].r - xcx[13].r ] + (xcx[1].r - xcx[9].r ]] - csp[14].i * csp[6].i * [(xcx[3].r - xcx[11].r ] - csp[2].i * (xcx[7].r - xcx[15].r ]] xcx[15].r = [(xcx[].r - xcx[8].r ] - csp[2].i * (xcx[4].r - xcx[12].r ]] - csp[6].i * [(xcx[2].r - xcx[1].r ] - csp[2].i * (xcx[6].r - xcx[14].r ]] - csp[14].i * [ - csp[2].i * (xcx[5].r - xcx[13].r ] + (xcx[1].r - xcx[9].r ]] + csp[14].i * csp[6].i * [(xcx[3].r - xcx[11].r ) - csp[2].i * (xcx[7].r - xcx[15].r )] Appendix B: case 496: // modified by Huang and Xi // calculate array aa if(rank == ) fft_4dec(mode,124,aa); // calculate arrays bb, cc and dd, then split them into two //subarrays each

13 if(rank == 1) fft_4dec(mode,124,bb); for(i = ; i < M; i++) bb_r[i] = (bb[i]).r; bb_i[i] = (bb[i]).i; if(rank == 2) fft_4dec(mode,124,cc); for(i = ; i < M; i++) cc_r[i] = (cc[i]).r; cc_i[i] = (cc[i]).i; if(rank == 3) fft_4dec(mode,124,dd); for(i = ; i < M; i++) dd_r[i] = (dd[i]).r; dd_i[i] = (dd[i]).i; MPI_Barrier(MPI_COMM_WORLD); // send and receive the arrays if (rank == 1) MPI_Send(bb_r, M, MPI_FLOAT,,, MPI_COMM_WORLD); MPI_Send(bb_i, M, MPI_FLOAT,,, MPI_COMM_WORLD); if (rank == 2) MPI_Send(cc_r, M, MPI_FLOAT,,, MPI_COMM_WORLD); MPI_Send(cc_i, M, MPI_FLOAT,,, MPI_COMM_WORLD);

14 if (rank == 3) MPI_Send(dd_r, M, MPI_FLOAT,,, MPI_COMM_WORLD); MPI_Send(dd_i, M, MPI_FLOAT,,, MPI_COMM_WORLD); if (rank == ) MPI_Recv(bb_r,M, MPI_FLOAT, 1,, MPI_COMM_WORLD, &status); MPI_Recv(bb_i,M, MPI_FLOAT, 1,, MPI_COMM_WORLD, &status); MPI_Recv(cc_r,M, MPI_FLOAT, 2,, MPI_COMM_WORLD, &status); MPI_Recv(cc_i,M, MPI_FLOAT, 2,, MPI_COMM_WORLD, &status); MPI_Recv(dd_r,M, MPI_FLOAT, 3,, MPI_COMM_WORLD, &status); MPI_Recv(dd_i,M, MPI_FLOAT, 3,, MPI_COMM_WORLD, &status); // recombine bb, cc and dd for (i = ; i < M; i++) bb[i].r = bb_r[i]; bb[i].i = bb_i[i]; cc[i].r = cc_r[i]; cc[i].i = cc_i[i]; dd[i].r = dd_r[i]; dd[i].i = dd_i[i]; break; case 8192: // modified by Huang and Xi // calculate array aa if(rank == ) fft_4dec(mode,248,aa); // calculate arrays bb, cc and dd, then split them into two //subarrays each if(rank == 1) fft_4dec(mode,248,bb); for(i = ; i < M; i++) bb_r[i] = (bb[i]).r; bb_i[i] = (bb[i]).i;

15 if(rank == 2) fft_4dec(mode,248,cc); for(i = ; i < M; i++) cc_r[i] = (cc[i]).r; cc_i[i] = (cc[i]).i; if(rank == 3) fft_4dec(mode,248,dd); for(i = ; i < M; i++) dd_r[i] = (dd[i]).r; dd_i[i] = (dd[i]).i; MPI_Barrier(MPI_COMM_WORLD); // send and receive the arrays if (rank == 1) MPI_Send(bb_r, M, MPI_FLOAT,,, MPI_COMM_WORLD); MPI_Send(bb_i, M, MPI_FLOAT,,, MPI_COMM_WORLD); if (rank == 2) MPI_Send(cc_r, M, MPI_FLOAT,,, MPI_COMM_WORLD); MPI_Send(cc_i, M, MPI_FLOAT,,, MPI_COMM_WORLD); if (rank == 3) MPI_Send(dd_r, M, MPI_FLOAT,,, MPI_COMM_WORLD); MPI_Send(dd_i, M, MPI_FLOAT,,, MPI_COMM_WORLD);

16 if (rank == ) MPI_Recv(bb_r,M, MPI_FLOAT, 1,, MPI_COMM_WORLD, &status); MPI_Recv(bb_i,M, MPI_FLOAT, 1,, MPI_COMM_WORLD, &status); MPI_Recv(cc_r,M, MPI_FLOAT, 2,, MPI_COMM_WORLD, &status); MPI_Recv(cc_i,M, MPI_FLOAT, 2,, MPI_COMM_WORLD, &status); MPI_Recv(dd_r,M, MPI_FLOAT, 3,, MPI_COMM_WORLD, &status); MPI_Recv(dd_i,M, MPI_FLOAT, 3,, MPI_COMM_WORLD, &status); // recombine bb, cc and dd for (i = ; i < M; i++) bb[i].r = bb_r[i]; bb[i].i = bb_i[i]; cc[i].r = cc_r[i]; cc[i].i = cc_i[i]; dd[i].r = dd_r[i]; dd[i].i = dd_i[i]; break;

Using a Scalable Parallel 2D FFT for Image Enhancement

Using a Scalable Parallel 2D FFT for Image Enhancement Introduction Using a Scalable Parallel 2D FFT for Image Enhancement Yaniv Sapir Adapteva, Inc. Email: yaniv@adapteva.com Frequency domain operations on spatial or time data are often used as a means for

More information

Latest Innovation For FFT implementation using RCBNS

Latest Innovation For FFT implementation using RCBNS Latest Innovation For FFT implementation using SADAF SAEED, USMAN ALI, SHAHID A. KHAN Department of Electrical Engineering COMSATS Institute of Information Technology, Abbottabad (Pakistan) Abstract: -

More information

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Structure Page Nos. 2.0 Introduction 4 2. Objectives 5 2.2 Metrics for Performance Evaluation 5 2.2. Running Time 2.2.2 Speed Up 2.2.3 Efficiency 2.3 Factors

More information

Introduction to Parallel Programming

Introduction to Parallel Programming Introduction to Parallel Programming Linda Woodard CAC 19 May 2010 Introduction to Parallel Computing on Ranger 5/18/2010 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor

More information

Assignment 3 MPI Tutorial Compiling and Executing MPI programs

Assignment 3 MPI Tutorial Compiling and Executing MPI programs Assignment 3 MPI Tutorial Compiling and Executing MPI programs B. Wilkinson: Modification date: February 11, 2016. This assignment is a tutorial to learn how to execute MPI programs and explore their characteristics.

More information

A TIMING AND SCALABILITY ANALYSIS OF THE PARALLEL PERFORMANCE OF CMAQ v4.5 ON A BEOWULF LINUX CLUSTER

A TIMING AND SCALABILITY ANALYSIS OF THE PARALLEL PERFORMANCE OF CMAQ v4.5 ON A BEOWULF LINUX CLUSTER A TIMING AND SCALABILITY ANALYSIS OF THE PARALLEL PERFORMANCE OF CMAQ v4.5 ON A BEOWULF LINUX CLUSTER Shaheen R. Tonse* Lawrence Berkeley National Lab., Berkeley, CA, USA 1. INTRODUCTION The goal of this

More information

Digital Signal Processing. Soma Biswas

Digital Signal Processing. Soma Biswas Digital Signal Processing Soma Biswas 2017 Partial credit for slides: Dr. Manojit Pramanik Outline What is FFT? Types of FFT covered in this lecture Decimation in Time (DIT) Decimation in Frequency (DIF)

More information

Performance Comparison between Blocking and Non-Blocking Communications for a Three-Dimensional Poisson Problem

Performance Comparison between Blocking and Non-Blocking Communications for a Three-Dimensional Poisson Problem Performance Comparison between Blocking and Non-Blocking Communications for a Three-Dimensional Poisson Problem Guan Wang and Matthias K. Gobbert Department of Mathematics and Statistics, University of

More information

Page 1. Program Performance Metrics. Program Performance Metrics. Amdahl s Law. 1 seq seq 1

Page 1. Program Performance Metrics. Program Performance Metrics. Amdahl s Law. 1 seq seq 1 Program Performance Metrics The parallel run time (Tpar) is the time from the moment when computation starts to the moment when the last processor finished his execution The speedup (S) is defined as the

More information

Parallel Algorithms for the Third Extension of the Sieve of Eratosthenes. Todd A. Whittaker Ohio State University

Parallel Algorithms for the Third Extension of the Sieve of Eratosthenes. Todd A. Whittaker Ohio State University Parallel Algorithms for the Third Extension of the Sieve of Eratosthenes Todd A. Whittaker Ohio State University whittake@cis.ohio-state.edu Kathy J. Liszka The University of Akron liszka@computer.org

More information

6. Fast Fourier Transform

6. Fast Fourier Transform x[] X[] x[] x[] x[6] X[] X[] X[3] x[] x[5] x[3] x[7] 3 X[] X[5] X[6] X[7] A Historical Perspective The Cooley and Tukey Fast Fourier Transform (FFT) algorithm is a turning point to the computation of DFT

More information

Programming with MPI. Pedro Velho

Programming with MPI. Pedro Velho Programming with MPI Pedro Velho Science Research Challenges Some applications require tremendous computing power - Stress the limits of computing power and storage - Who might be interested in those applications?

More information

Programming with MPI on GridRS. Dr. Márcio Castro e Dr. Pedro Velho

Programming with MPI on GridRS. Dr. Márcio Castro e Dr. Pedro Velho Programming with MPI on GridRS Dr. Márcio Castro e Dr. Pedro Velho Science Research Challenges Some applications require tremendous computing power - Stress the limits of computing power and storage -

More information

CHAPTER 6 A SECURE FAST 2D-DISCRETE FRACTIONAL FOURIER TRANSFORM BASED MEDICAL IMAGE COMPRESSION USING SPIHT ALGORITHM WITH HUFFMAN ENCODER

CHAPTER 6 A SECURE FAST 2D-DISCRETE FRACTIONAL FOURIER TRANSFORM BASED MEDICAL IMAGE COMPRESSION USING SPIHT ALGORITHM WITH HUFFMAN ENCODER 115 CHAPTER 6 A SECURE FAST 2D-DISCRETE FRACTIONAL FOURIER TRANSFORM BASED MEDICAL IMAGE COMPRESSION USING SPIHT ALGORITHM WITH HUFFMAN ENCODER 6.1. INTRODUCTION Various transforms like DCT, DFT used to

More information

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Texas Learning and Computation Center Department of Computer Science University of Houston Outline Motivation

More information

Peter Pacheco. Chapter 3. Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved

Peter Pacheco. Chapter 3. Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved An Introduction to Parallel Programming Peter Pacheco Chapter 3 Distributed Memory Programming with MPI 1 Roadmap Writing your first MPI program. Using the common MPI functions. The Trapezoidal Rule in

More information

Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved

Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved An Introduction to Parallel Programming Peter Pacheco Chapter 3 Distributed Memory Programming with MPI 1 Roadmap Writing your first MPI program. Using the common MPI functions. The Trapezoidal Rule in

More information

MPI Tutorial. Shao-Ching Huang. High Performance Computing Group UCLA Institute for Digital Research and Education

MPI Tutorial. Shao-Ching Huang. High Performance Computing Group UCLA Institute for Digital Research and Education MPI Tutorial Shao-Ching Huang High Performance Computing Group UCLA Institute for Digital Research and Education Center for Vision, Cognition, Learning and Art, UCLA July 15 22, 2013 A few words before

More information

Faculty of Electrical and Computer Engineering Department of Electrical and Computer Engineering Program: Computer Engineering

Faculty of Electrical and Computer Engineering Department of Electrical and Computer Engineering Program: Computer Engineering Faculty of Electrical and Computer Engineering Department of Electrical and Computer Engineering Program: Computer Engineering Course Number EE 8218 011 Section Number 01 Course Title Parallel Computing

More information

Rapid Prototyping System for Teaching Real-Time Digital Signal Processing

Rapid Prototyping System for Teaching Real-Time Digital Signal Processing IEEE TRANSACTIONS ON EDUCATION, VOL. 43, NO. 1, FEBRUARY 2000 19 Rapid Prototyping System for Teaching Real-Time Digital Signal Processing Woon-Seng Gan, Member, IEEE, Yong-Kim Chong, Wilson Gong, and

More information

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP COMP4510 Introduction to Parallel Computation Shared Memory and OpenMP Thanks to Jon Aronsson (UofM HPC consultant) for some of the material in these notes. Outline (cont d) Shared Memory and OpenMP Including

More information

Module 9 : Numerical Relaying II : DSP Perspective

Module 9 : Numerical Relaying II : DSP Perspective Module 9 : Numerical Relaying II : DSP Perspective Lecture 36 : Fast Fourier Transform Objectives In this lecture, We will introduce Fast Fourier Transform (FFT). We will show equivalence between FFT and

More information

Twiddle Factor Transformation for Pipelined FFT Processing

Twiddle Factor Transformation for Pipelined FFT Processing Twiddle Factor Transformation for Pipelined FFT Processing In-Cheol Park, WonHee Son, and Ji-Hoon Kim School of EECS, Korea Advanced Institute of Science and Technology, Daejeon, Korea icpark@ee.kaist.ac.kr,

More information

CSE. Parallel Algorithms on a cluster of PCs. Ian Bush. Daresbury Laboratory (With thanks to Lorna Smith and Mark Bull at EPCC)

CSE. Parallel Algorithms on a cluster of PCs. Ian Bush. Daresbury Laboratory (With thanks to Lorna Smith and Mark Bull at EPCC) Parallel Algorithms on a cluster of PCs Ian Bush Daresbury Laboratory I.J.Bush@dl.ac.uk (With thanks to Lorna Smith and Mark Bull at EPCC) Overview This lecture will cover General Message passing concepts

More information

Chapter 3. Distributed Memory Programming with MPI

Chapter 3. Distributed Memory Programming with MPI An Introduction to Parallel Programming Peter Pacheco Chapter 3 Distributed Memory Programming with MPI 1 Roadmap n Writing your first MPI program. n Using the common MPI functions. n The Trapezoidal Rule

More information

Unit 9 : Fundamentals of Parallel Processing

Unit 9 : Fundamentals of Parallel Processing Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing

More information

Quiz for Chapter 1 Computer Abstractions and Technology

Quiz for Chapter 1 Computer Abstractions and Technology Date: Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [15 points] Consider two different implementations,

More information

Anadarko Public Schools MATH Power Standards

Anadarko Public Schools MATH Power Standards Anadarko Public Schools MATH Power Standards Kindergarten 1. Say the number name sequence forward and backward beginning from a given number within the known sequence (counting on, spiral) 2. Write numbers

More information

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico. OpenMP and MPI Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico November 15, 2010 José Monteiro (DEI / IST) Parallel and Distributed Computing

More information

Parallel-computing approach for FFT implementation on digital signal processor (DSP)

Parallel-computing approach for FFT implementation on digital signal processor (DSP) Parallel-computing approach for FFT implementation on digital signal processor (DSP) Yi-Pin Hsu and Shin-Yu Lin Abstract An efficient parallel form in digital signal processor can improve the algorithm

More information

DESIGN METHODOLOGY. 5.1 General

DESIGN METHODOLOGY. 5.1 General 87 5 FFT DESIGN METHODOLOGY 5.1 General The fast Fourier transform is used to deliver a fast approach for the processing of data in the wireless transmission. The Fast Fourier Transform is one of the methods

More information

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering

More information

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2 Introduction :- Today single CPU based architecture is not capable enough for the modern database that are required to handle more demanding and complex requirements of the users, for example, high performance,

More information

ITCS 4145/5145 Assignment 2

ITCS 4145/5145 Assignment 2 ITCS 4145/5145 Assignment 2 Compiling and running MPI programs Author: B. Wilkinson and Clayton S. Ferner. Modification date: September 10, 2012 In this assignment, the workpool computations done in Assignment

More information

Message Passing Interface (MPI)

Message Passing Interface (MPI) CS 220: Introduction to Parallel Computing Message Passing Interface (MPI) Lecture 13 Today s Schedule Parallel Computing Background Diving in: MPI The Jetson cluster 3/7/18 CS 220: Parallel Computing

More information

PARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION

PARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION PARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION Stanislav Kontár Speech@FIT, Dept. of Computer Graphics and Multimedia, FIT, BUT, Brno, Czech Republic E-mail: xkonta00@stud.fit.vutbr.cz In

More information

Cache-oblivious Programming

Cache-oblivious Programming Cache-oblivious Programming Story so far We have studied cache optimizations for array programs Main transformations: loop interchange, loop tiling Loop tiling converts matrix computations into block matrix

More information

Image Compression System on an FPGA

Image Compression System on an FPGA Image Compression System on an FPGA Group 1 Megan Fuller, Ezzeldin Hamed 6.375 Contents 1 Objective 2 2 Background 2 2.1 The DFT........................................ 3 2.2 The DCT........................................

More information

Chapter 1. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

Chapter 1. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002, Chapter 1 Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Course Goals Introduce you to design principles, analysis techniques and design options in computer architecture

More information

1:21. Down sampling/under sampling. The spectrum has the same shape, but the periodicity is twice as dense.

1:21. Down sampling/under sampling. The spectrum has the same shape, but the periodicity is twice as dense. 1:21 Down sampling/under sampling The spectrum has the same shape, but the periodicity is twice as dense. 2:21 SUMMARY 1) The DFT only gives a 100% correct result, if the input sequence is periodic. 2)

More information

CS 426. Building and Running a Parallel Application

CS 426. Building and Running a Parallel Application CS 426 Building and Running a Parallel Application 1 Task/Channel Model Design Efficient Parallel Programs (or Algorithms) Mainly for distributed memory systems (e.g. Clusters) Break Parallel Computations

More information

[4] 1 cycle takes 1/(3x10 9 ) seconds. One access to memory takes 50/(3x10 9 ) seconds. =16ns. Performance = 4 FLOPS / (2x50/(3x10 9 )) = 120 MFLOPS.

[4] 1 cycle takes 1/(3x10 9 ) seconds. One access to memory takes 50/(3x10 9 ) seconds. =16ns. Performance = 4 FLOPS / (2x50/(3x10 9 )) = 120 MFLOPS. Give your answers in the space provided with each question. Answers written elsewhere will not be graded. Q1). [4 points] Consider a memory system with level 1 cache of 64 KB and DRAM of 1GB with processor

More information

Outline. CS38 Introduction to Algorithms. Fast Fourier Transform (FFT) Fast Fourier Transform (FFT) Fast Fourier Transform (FFT)

Outline. CS38 Introduction to Algorithms. Fast Fourier Transform (FFT) Fast Fourier Transform (FFT) Fast Fourier Transform (FFT) Outline CS8 Introduction to Algorithms Lecture 9 April 9, 0 Divide and Conquer design paradigm matrix multiplication Dynamic programming design paradigm Fibonacci numbers weighted interval scheduling knapsack

More information

DESIGN AND ANALYSIS OF ALGORITHMS. Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES

DESIGN AND ANALYSIS OF ALGORITHMS. Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES DESIGN AND ANALYSIS OF ALGORITHMS Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES http://milanvachhani.blogspot.in USE OF LOOPS As we break down algorithm into sub-algorithms, sooner or later we shall

More information

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Performance COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals What is Performance? How do we measure the performance of

More information

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Quiz for Chapter 1 Computer Abstractions and Technology 3.10 Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: 1. [15 points] Consider two different implementations, M1 and

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline

More information

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm

More information

Introduction to Medical Imaging. Cone-Beam CT. Klaus Mueller. Computer Science Department Stony Brook University

Introduction to Medical Imaging. Cone-Beam CT. Klaus Mueller. Computer Science Department Stony Brook University Introduction to Medical Imaging Cone-Beam CT Klaus Mueller Computer Science Department Stony Brook University Introduction Available cone-beam reconstruction methods: exact approximate algebraic Our discussion:

More information

High-Performance and Parallel Computing

High-Performance and Parallel Computing 9 High-Performance and Parallel Computing 9.1 Code optimization To use resources efficiently, the time saved through optimizing code has to be weighed against the human resources required to implement

More information

Complexity Measures for Map-Reduce, and Comparison to Parallel Computing

Complexity Measures for Map-Reduce, and Comparison to Parallel Computing Complexity Measures for Map-Reduce, and Comparison to Parallel Computing Ashish Goel Stanford University and Twitter Kamesh Munagala Duke University and Twitter November 11, 2012 The programming paradigm

More information

Solution of Exercise Sheet 2

Solution of Exercise Sheet 2 Solution of Exercise Sheet 2 Exercise 1 (Cluster Computing) 1. Give a short definition of Cluster Computing. Clustering is parallel computing on systems with distributed memory. 2. What is a Cluster of

More information

HPC Parallel Programing Multi-node Computation with MPI - I

HPC Parallel Programing Multi-node Computation with MPI - I HPC Parallel Programing Multi-node Computation with MPI - I Parallelization and Optimization Group TATA Consultancy Services, Sahyadri Park Pune, India TCS all rights reserved April 29, 2013 Copyright

More information

Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES

Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES DESIGN AND ANALYSIS OF ALGORITHMS Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES http://milanvachhani.blogspot.in USE OF LOOPS As we break down algorithm into sub-algorithms, sooner or later we shall

More information

SDP Memo 048: Two Dimensional Sparse Fourier Transform Algorithms

SDP Memo 048: Two Dimensional Sparse Fourier Transform Algorithms SDP Memo 048: Two Dimensional Sparse Fourier Transform Algorithms Document Number......................................................... SDP Memo 048 Document Type.....................................................................

More information

Analysis of Radix- SDF Pipeline FFT Architecture in VLSI Using Chip Scope

Analysis of Radix- SDF Pipeline FFT Architecture in VLSI Using Chip Scope Analysis of Radix- SDF Pipeline FFT Architecture in VLSI Using Chip Scope G. Mohana Durga 1, D.V.R. Mohan 2 1 M.Tech Student, 2 Professor, Department of ECE, SRKR Engineering College, Bhimavaram, Andhra

More information

Fiber Fourier optics

Fiber Fourier optics Final version printed as of 4/7/00 Accepted for publication in Optics Letters Fiber Fourier optics A. E. Siegman Ginzton Laboratory, Stanford University, Stanford, California 94305 Received The Fourier

More information

Review: Creating a Parallel Program. Programming for Performance

Review: Creating a Parallel Program. Programming for Performance Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)

More information

CO Computer Architecture and Programming Languages CAPL. Lecture 15

CO Computer Architecture and Programming Languages CAPL. Lecture 15 CO20-320241 Computer Architecture and Programming Languages CAPL Lecture 15 Dr. Kinga Lipskoch Fall 2017 How to Compute a Binary Float Decimal fraction: 8.703125 Integral part: 8 1000 Fraction part: 0.703125

More information

Introduction to Algorithms

Introduction to Algorithms Lecture 1 Introduction to Algorithms 1.1 Overview The purpose of this lecture is to give a brief overview of the topic of Algorithms and the kind of thinking it involves: why we focus on the subjects that

More information

Abstract. Introduction. Kevin Todisco

Abstract. Introduction. Kevin Todisco - Kevin Todisco Figure 1: A large scale example of the simulation. The leftmost image shows the beginning of the test case, and shows how the fluid refracts the environment around it. The middle image

More information

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core Tuning on a single core 1 From models to practice In lecture 2, we discussed features such as instruction-level parallelism and cache hierarchies that we need to understand in order to have a reasonable

More information

Digital Image Processing. Image Enhancement in the Frequency Domain

Digital Image Processing. Image Enhancement in the Frequency Domain Digital Image Processing Image Enhancement in the Frequency Domain Topics Frequency Domain Enhancements Fourier Transform Convolution High Pass Filtering in Frequency Domain Low Pass Filtering in Frequency

More information

Parallel Programming Assignment 3 Compiling and running MPI programs

Parallel Programming Assignment 3 Compiling and running MPI programs Parallel Programming Assignment 3 Compiling and running MPI programs Author: Clayton S. Ferner and B. Wilkinson Modification date: October 11a, 2013 This assignment uses the UNC-Wilmington cluster babbage.cis.uncw.edu.

More information

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication

More information

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico. OpenMP and MPI Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico November 16, 2011 CPD (DEI / IST) Parallel and Distributed Computing 18

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures

Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures University of Virginia Dept. of Computer Science Technical Report #CS-2011-09 Jeremy W. Sheaffer and Kevin

More information

CITS2401 Computer Analysis & Visualisation

CITS2401 Computer Analysis & Visualisation FACULTY OF ENGINEERING, COMPUTING AND MATHEMATICS CITS2401 Computer Analysis & Visualisation SCHOOL OF COMPUTER SCIENCE AND SOFTWARE ENGINEERING Topic 3 Introduction to Matlab Material from MATLAB for

More information

The Fast Fourier Transform Algorithm and Its Application in Digital Image Processing

The Fast Fourier Transform Algorithm and Its Application in Digital Image Processing The Fast Fourier Transform Algorithm and Its Application in Digital Image Processing S.Arunachalam(Associate Professor) Department of Mathematics, Rizvi College of Arts, Science & Commerce, Bandra (West),

More information

Computer Caches. Lab 1. Caching

Computer Caches. Lab 1. Caching Lab 1 Computer Caches Lab Objective: Caches play an important role in computational performance. Computers store memory in various caches, each with its advantages and drawbacks. We discuss the three main

More information

Back-Projection on GPU: Improving the Performance

Back-Projection on GPU: Improving the Performance UNIVERSITY OF MICHIGAN Back-Projection on GPU: Improving the Performance EECS 499 Independent Study Wenlay Esther Wei 4/29/2010 The purpose of this project is to accelerate the processing speed of the

More information

MPI. (message passing, MIMD)

MPI. (message passing, MIMD) MPI (message passing, MIMD) What is MPI? a message-passing library specification extension of C/C++ (and Fortran) message passing for distributed memory parallel programming Features of MPI Point-to-point

More information

Technical Report. SLI Best Practices

Technical Report. SLI Best Practices Technical Report SLI Best Practices Abstract This paper describes techniques that can be used to perform application-side detection of SLI-configured systems, as well as ensure maximum performance scaling

More information

MPI Message Passing Interface

MPI Message Passing Interface MPI Message Passing Interface Portable Parallel Programs Parallel Computing A problem is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information

More information

Theory Implementation Results Conclusion References. Cloud Computing. Special Task 2 - Parallel Merge Sort with MPI Summer Term 2018

Theory Implementation Results Conclusion References. Cloud Computing. Special Task 2 - Parallel Merge Sort with MPI Summer Term 2018 Agrawal, Cocos, Merkl, Santos Summer Term 2018 Cloud Computing 1/19 Cloud Computing Special Task 2 - Parallel Merge Sort with MPI Summer Term 2018 Prachi Agrawal, Henry Cocos, David Merkl, Samuel Santos

More information

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh MPI Optimisation Advanced Parallel Programming David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh Overview Can divide overheads up into four main categories: Lack of parallelism Load imbalance

More information

University of Southern California Department of Electrical Engineering EE657 Spring 2K2 Instructor: Michel Dubois Homework #2.

University of Southern California Department of Electrical Engineering EE657 Spring 2K2 Instructor: Michel Dubois Homework #2. University of Southern California Department of Electrical Engineering EE657 Spring 2K2 Instructor: Michel Dubois Homework #2. Solutions Problem 1 Problem 3.12 in CSG (a) Clearly the partitioning can be

More information

MPI: The Message-Passing Interface. Most of this discussion is from [1] and [2].

MPI: The Message-Passing Interface. Most of this discussion is from [1] and [2]. MPI: The Message-Passing Interface Most of this discussion is from [1] and [2]. What Is MPI? The Message-Passing Interface (MPI) is a standard for expressing distributed parallelism via message passing.

More information

Parallel Fast Fourier Transform implementations in Julia 12/15/2011

Parallel Fast Fourier Transform implementations in Julia 12/15/2011 Parallel Fast Fourier Transform implementations in Julia 1/15/011 Abstract This paper examines the parallel computation models of Julia through several different multiprocessor FFT implementations of 1D

More information

ENT 315 Medical Signal Processing CHAPTER 3 FAST FOURIER TRANSFORM. Dr. Lim Chee Chin

ENT 315 Medical Signal Processing CHAPTER 3 FAST FOURIER TRANSFORM. Dr. Lim Chee Chin ENT 315 Medical Signal Processing CHAPTER 3 FAST FOURIER TRANSFORM Dr. Lim Chee Chin Outline Definition and Introduction FFT Properties of FFT Algorithm of FFT Decimate in Time (DIT) FFT Steps for radix

More information

Implementation of Lifting-Based Two Dimensional Discrete Wavelet Transform on FPGA Using Pipeline Architecture

Implementation of Lifting-Based Two Dimensional Discrete Wavelet Transform on FPGA Using Pipeline Architecture International Journal of Computer Trends and Technology (IJCTT) volume 5 number 5 Nov 2013 Implementation of Lifting-Based Two Dimensional Discrete Wavelet Transform on FPGA Using Pipeline Architecture

More information

Using Java for Scientific Computing. Mark Bul EPCC, University of Edinburgh

Using Java for Scientific Computing. Mark Bul EPCC, University of Edinburgh Using Java for Scientific Computing Mark Bul EPCC, University of Edinburgh markb@epcc.ed.ac.uk Java and Scientific Computing? Benefits of Java for Scientific Computing Portability Network centricity Software

More information

Why Multiprocessors?

Why Multiprocessors? Why Multiprocessors? Motivation: Go beyond the performance offered by a single processor Without requiring specialized processors Without the complexity of too much multiple issue Opportunity: Software

More information

1 The range query problem

1 The range query problem CS268: Geometric Algorithms Handout #12 Design and Analysis Original Handout #12 Stanford University Thursday, 19 May 1994 Original Lecture #12: Thursday, May 19, 1994 Topics: Range Searching with Partition

More information

6.375 Ray Tracing Hardware Accelerator

6.375 Ray Tracing Hardware Accelerator 6.375 Ray Tracing Hardware Accelerator Chun Fai Cheung, Sabrina Neuman, Michael Poon May 13, 2010 Abstract This report describes the design and implementation of a hardware accelerator for software ray

More information

Tutorial: Analyzing MPI Applications. Intel Trace Analyzer and Collector Intel VTune Amplifier XE

Tutorial: Analyzing MPI Applications. Intel Trace Analyzer and Collector Intel VTune Amplifier XE Tutorial: Analyzing MPI Applications Intel Trace Analyzer and Collector Intel VTune Amplifier XE Contents Legal Information... 3 1. Overview... 4 1.1. Prerequisites... 5 1.1.1. Required Software... 5 1.1.2.

More information

Annals of the University of North Carolina Wilmington Master of Science in Computer Science and Information Systems

Annals of the University of North Carolina Wilmington Master of Science in Computer Science and Information Systems Annals of the University of North Carolina Wilmington Master of Science in Computer Science and Information Systems Suppressing Independent Loops in Packing/Unpacking Loop Nests to Reduce Message Size

More information

C for Engineers and Scientists: An Interpretive Approach. Chapter 10: Arrays

C for Engineers and Scientists: An Interpretive Approach. Chapter 10: Arrays Chapter 10: Arrays 10.1 Declaration of Arrays 10.2 How arrays are stored in memory One dimensional (1D) array type name[expr]; type is a data type, e.g. int, char, float name is a valid identifier (cannot

More information

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 Message passing vs. Shared memory Client Client Client Client send(msg) recv(msg) send(msg) recv(msg) MSG MSG MSG IPC Shared

More information

Improving Http-Server Performance by Adapted Multithreading

Improving Http-Server Performance by Adapted Multithreading Improving Http-Server Performance by Adapted Multithreading Jörg Keller LG Technische Informatik II FernUniversität Hagen 58084 Hagen, Germany email: joerg.keller@fernuni-hagen.de Olaf Monien Thilo Lardon

More information

A Parallel Evolutionary Algorithm for Discovery of Decision Rules

A Parallel Evolutionary Algorithm for Discovery of Decision Rules A Parallel Evolutionary Algorithm for Discovery of Decision Rules Wojciech Kwedlo Faculty of Computer Science Technical University of Bia lystok Wiejska 45a, 15-351 Bia lystok, Poland wkwedlo@ii.pb.bialystok.pl

More information

Fixed Point Streaming Fft Processor For Ofdm

Fixed Point Streaming Fft Processor For Ofdm Fixed Point Streaming Fft Processor For Ofdm Sudhir Kumar Sa Rashmi Panda Aradhana Raju Abstract Fast Fourier Transform (FFT) processors are today one of the most important blocks in communication systems.

More information

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM

More information

Parallel FFT Program Optimizations on Heterogeneous Computers

Parallel FFT Program Optimizations on Heterogeneous Computers Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid

More information

White Paper Taking Advantage of Advances in FPGA Floating-Point IP Cores

White Paper Taking Advantage of Advances in FPGA Floating-Point IP Cores White Paper Recently available FPGA design tools and IP provide a substantial reduction in computational resources, as well as greatly easing the implementation effort in a floating-point datapath. Moreover,

More information

Scalability of Heterogeneous Computing

Scalability of Heterogeneous Computing Scalability of Heterogeneous Computing Xian-He Sun, Yong Chen, Ming u Department of Computer Science Illinois Institute of Technology {sun, chenyon1, wuming}@iit.edu Abstract Scalability is a key factor

More information

Parallel Computing and the MPI environment

Parallel Computing and the MPI environment Parallel Computing and the MPI environment Claudio Chiaruttini Dipartimento di Matematica e Informatica Centro Interdipartimentale per le Scienze Computazionali (CISC) Università di Trieste http://www.dmi.units.it/~chiarutt/didattica/parallela

More information

Introduction to High-Performance Computing

Introduction to High-Performance Computing Introduction to High-Performance Computing Simon D. Levy BIOL 274 17 November 2010 Chapter 12 12.1: Concurrent Processing High-Performance Computing A fancy term for computers significantly faster than

More information

Towards Breast Anatomy Simulation Using GPUs

Towards Breast Anatomy Simulation Using GPUs Towards Breast Anatomy Simulation Using GPUs Joseph H. Chui 1, David D. Pokrajac 2, Andrew D.A. Maidment 3, and Predrag R. Bakic 4 1 Department of Radiology, University of Pennsylvania, Philadelphia PA

More information