Parallelization of FFT in AFNI
|
|
- Clarence Skinner
- 6 years ago
- Views:
Transcription
1 Parallelization of FFT in AFNI Huang, Jingshan and Xi, Hong Department of Computer Science and Engineering University of South Carolina Columbia, SC and Abstract. AFNI is a widely used software package for medical image processing. However it is not a real-time system. We are trying to make a parallelized version of AFNI, which will be nearly real-time. As the first step of this enormous task, we parallelize the FFT part of AFNI. Our method is based on MPI. The result shows that for the FFT algorithm itself, we obtain a speedup of around 2.62 in CPU time. 1 Introduction AFNI is a widely used software package for medical image processing. However, there is a big drawback of this system: it is not a real-time system, which to some extent has impaired the application in the related areas. Our ultimate goal is to make a parallelized version of AFNI, which will have the property of nearly realtime. As our first step, we would like to parallelize the FFT part of AFNI. One of the reasons that we choose FFT is that inside AFNI, FFT is extensively called by other functions. In fact, FFT is a fundamental function in AFNI. Therefore, to obtain a parallel version of FFT in AFNI will definitely have great significance in achieving our final goal. 2 Background 2.1 AFNI AFNI stands for Analysis of Functional NeuroImages. It is a set of C programs (over 1, source code files) for processing, analyzing, and displaying functional MRI (FMRI) data - a technique for mapping human brain activity. AFNI runs on Unix+X11+Motif systems, including SGI, Solaris, Linux, and Mac OS X. AFNI is an interactive program for viewing the results of 3D functional neuroimaging. It can overlay the (usually) low-resolution results of functional brain scans onto higher resolution structural volume data sets. By marking fiducial points, one may transform the data to the proportional grid (stereotaxic coordinates) of Talairach and Tournoux [1]. Time-dependent 3D volume data sets can also be created and viewed. In addition, there are some auxiliary programs provided for combining and editing 3D and 3D+time functional data sets. Although being widely used in the area of medical image processing, AFNI is not a real-time software package, which in some degree influences its application in some cases. 2.2 FFT[3] FFT, Fast Fourier Transform, is a discrete Fourier transform algorithm from discrete time domain to discrete spatial domain. It is simply a method of laying out the computation, which is much faster for large
2 values of N, where N is the number of samples in the sequence. FFT is an ingenious way of achieving rather than the DFT's clumsy N 2 timing. To carry out FFT, we will go backwards, starting with the 2-point transform. k 1 k V[k] = W2 v[] + W2 v[1], k=, 1, with the two components be: V[] = W v[] + W v[1] = v[] + W v[1] v[] + W2 v[1] = v[] + W2 V[1] = W v[1] Here V is a column vector which will store the FFT value of the input array v, and W 2 is the principal 2nd root of unity, which is equal to e i. In the following text, we use W N to represent the principal nth root of unity. We can represent the two equations for the components of the 2-point transform graphically using the, so called, butterfly v[] V[] W 2 v[1] 1 W 2 Fig. 1. Butterfly calculation Furthermore, using the divide and conquer strategy, a 4-point transform can be reduced to two 2-point transforms: one for even elements, one for odd elements. The odd one will be multiplied by W k 4. n Diagrammatically, this can be represented as two levels of butterflies. Notice that using the identity W N/2 = W 2n N, we can always express all the multipliers as powers of the same W N (in this case we choose N=4). v[] V[] V[1] v[2] 2 1 V[1] v[1] 2 V[2] v[3] 2 Fig. 2. Diagrammatical representation of the 4-point Fourier transform calculation In fact, all the butterflies have similar form: s W N 3 V[3] s N / 2 W + N Fig. 3. Generic butterfly graph
3 s+n/2 s N/2 This graph can be further simplified using the identity: W N = W N W N = -W N s, which is true because N/2 W N = e -2 i(n/2)/n = e -i = cos(- ) + isin(- ) = -1, and here is the simplified butterfly: W -1 s N Fig. 4. Simplified generic butterfly Using this result, we can simplify the 4-point diagram as: v[] V[] v[2] -1 V[1] v[1] -1 V[2] v[3] Fig point FFT calculation V[3] This diagram is the essence of the FFT algorithm. The main trick is that we do not calculate each component of the Fourier transform separately. That would involve unnecessary repetition of a substantial number of calculations. Instead, we do our calculations in stages. At each stage we start with N (in general complex) numbers and "butterfly" them to obtain a new set of N complex numbers. Those numbers, in turn, become the input for the next stage. The calculation of a 4-point FFT involves two stages. The input of the first stage are the 4 original samples. The output of the second stage are the 4 components of the Fourier transform. Notice that each stage involves N/2 complex multiplications (or N real multiplications), N/2 sign inversions (multiplication by -1), and N complex additions. So each stage can be done in O(N) time. The number of stages is log 2 N (which, since N is a power of 2, is the exponent m in N = 2 m ). Altogether, the FFT requires on the order of O(Nlog 2 N) calculations. The new complexity comes from the following derivation. T (N) = 2T(N / 2) + c N = 2 [2T(N / 4) + c N / 2] + c N = 4 T(N / 4) + 2 c N =... k k = 2 T(N / 2 ) + k c N When 2 k k = N, T(N / 2 ) = 1, and k = log2 N, so T(N) = N + log2 N c N = O(N log2 N). Moreover, the calculations can be done in-place, using a single buffer of N complex numbers. The trick is to initialize this buffer with appropriately scrambled samples. For N=4, the order of samples is v[], v[2], v[1], v[3]. In general, according to our basic identity, we first divide the samples into two groups, even ones and odd ones. Applying this division recursively, we split these groups of samples into two groups each by selecting every other sample.
4 In a summary, the naive implementation of the N-point digital Fourier transform involves calculating the scalar product of the sample buffer (treated as an N-dimensional vector) with N separate basis vectors. Since each scalar product involves N multiplications and N additions, the total time is proportional to N 2 (in other words, it is an O(N 2 ) algorithm). However, it turns out that by cleverly re-arranging these operations, one can optimize the algorithm down to O(Nlog 2 N), which for large N makes a huge difference. The idea behind the FFT is the standard strategy to speed up an algorithm, i.e. to divide and conquer, which will break up the original N point sample into two (N/2) sequences. This is because a series of smaller problems is easier to solve than one large one. The DFT requires (N-1) 2 complex multiplications and N(N- 1) complex additions as opposed to the FFT's approach of breaking it down into a series of 2 point samples which only require 1 multiplication and 2 additions and the recombination of the points which is minimal. 2.3 FFT in AFNI Inside AFNI, there are many functions which will use the FFT algorithm to perform the Fourier Transform. In fact, FFT is a very fundamental function in AFNI. As our first step to parallelize AFNI, we will implement a parallelized version of FFT here. 2.4 MPI MPI, Message-Passing Interface [2], is the most widely used approach to develop a parallel system. Rather than specifying a new language (and hence a new compiler), it has specified a library of functions that can be called from a C or Fortran program. The foundation of this library is a small group of functions that can be used to achieve parallelism by message passing. Message passing function, simply a function that explicitly transmits data from one process to another, is a powerful and very general method of expressing parallelism. Message-passing programs can be used to create extremely efficient parallel programs, and message passing is currently the most widely used method of programming many types of parallel computers. However, its principal drawback is that it is very difficult to design and develop programs using message passing. Indeed, it has been called the assembly language of parallel computing because it forces the programmer to deal with so much detail. The introduction of MPI makes it possible for developers of parallel software to write libraries of parallel programs that are both portable and efficient. Use of these libraries will hide many of the details of parallel programming and, as a consequence, make parallel computing much more accessible to professionals in all branches of science and engineering. 3 Method AFNI contains thousands of source code files, and our focus is on csfft.c. Inside this file, we concentrate on the function of csfft_cox() that performs the FFT and will be called by many other functions. 3.1 General Flow Chart of csfft_cox
5 SCLINV return fft2 fft4 fft8 fft16 fft32 csfft_cox start fft64 fft128 fft256 fft512 fft124 fft_4dec fft248 fft_4dec fft496 fft_4dec fft8192 fft_4dec fft16384 fft32768 fft_4dec fft_4dec 3n 5n fft_3dec fft_5dec Fig. 6. Flow chart of csfft_cox 3.2 Analysis of fftn functions We analyze the hard-coded part of five fftn functions, i.e. fft2(), fft4(), fft8(), fft16() and fft32() and translate the original code into the parallelized version. Here we take fft4() as an example. xcx[].r = (xcx[].r + xcx[2].r ) + (xcx[1].r + xcx[3].r ) xcx[2].r = (xcx[].r + xcx[2].r ) - (xcx[1].r + xcx[3].r ) xcx[1].r = (xcx[].r - xcx[2].r ) - (xcx[1].r - xcx[3].r ) * csp[2].i xcx[3].r = (xcx[].r - xcx[2].r ) + (xcx[1].r - xcx[3].r ) * csp[2].i For other hard-coded functions, the readers can refer to the Appendix A. 3.3 One-level parallelization There are several options for us to parallel the csfft_cox() function. At present, we adopt the one-level parallelization method, that is, when fft496() calls fft124() and when fft8192() calls fft248().
6 fftest.c is a source code file written solely for the purpose of testing FFT speed. It will calculate the CPU time, wall clock time (elapsed time), bytes of data being processed and MFLOPs. First of all, we need to set up the related initiation functions of MPI in the main function of fftest.c file. MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &np); MPI_Comm_rank(MPI_COMM_WORLD, &rank); Then in the csfft_cox() function, we deal with 496 and 8192 cases (for the detail information, refer to Appendix B). Finally, back in the main function of fftest.c file, we need to terminate the MPI. MPI_Finalize(); 4 Experiment Results In the following charts, we will show some of our experiment results. Notice that the first row of each chart has three parameters, which will be the FFT length, number of iterations and the dimension of vector (how many FFT per iteration) respectively. 496 * 5, * 1 CPU Time System Time Elapsed Time Average CPU Time Original Code Processors Processors * 1, * 1 CPU Time System Time Elapsed Time Average CPU Time Original Code Processors Processors * 2, * 1 CPU Time System Time Elapsed Time Average CPU Time Original Code Processors Processors * 1, * 1 CPU Time System Time Elapsed Time Average CPU Time Original Code Processors Processors
7 Comprehensive Chart 1 for FFT 496 System Time / Total CPU time System Time/ Rank CPU time Iterations 5, 1, 2, 1, 5, 1, 2, 1, Original Code Processors Processors Comprehensive Chart 2 for FFT 496 Total CPU time / Elapsed Time Rank CPU time / Elapsed Time Iterations 5, 1, 2, 1, 5, 1, 2, 1, Original Code Processors Processors Comprehensive Chart 3 for FFT 496 sequential CPU time / Average CPU time (except for rank ) sequential CPU time / Rank CPU time Iterations 5, 1, 2, 1, 5, 1, 2, 1, 2 Processors Processors * 5, * 1 CPU Time System Time Elapsed Time Average CPU Time Original Code Processors Processors * 1, * 1 CPU Time System Time Elapsed Time Average CPU Time Original Code Processors Processors * 2, * 1 CPU Time System Time Elapsed Time Average CPU Time Original Code Processors Processors * 1, * 1 CPU Time System Time Elapsed Time Average CPU Time Original Code Processors Processors
8 Comprehensive Chart 1 for FFT 8192 System Time / Total CPU time System Time/ Rank CPU time Iterations 5, 1, 2, 1, 5, 1, 2, 1, Original Code Processors Processors Comprehensive Chart 2 for FFT 8192 Total CPU time / Elapsed Time Rank CPU time / Elapsed Time Iterations 5, 1, 2, 1, 5, 1, 2, 1, Original Code Processors Processors Comprehensive Chart 3 for FFT 8192 sequential CPU time / Average CPU time (except for rank ) sequential CPU time / Rank CPU time Iterations 5, 1, 2, 1, 5, 1, 2, 1, 2 Processors Processors Discussion 5.1 The correctness of our code In our driver program, fftest.c, which is used to test the speed of FFT, we read data from a file that contains 2,48, complex numbers. After the FFT and IFFT consequently, we obtain a set of complex numbers that are almost the same as the ones in the original data file. The difference comes from the fact that both the real and imaginary part of those complex numbers are floating point numbers. This result shows that our parallelized version of FFT code functions correctly. 5.2 One-level parallelization verse multi-level parallelization In the original sequential code, the calculation has many data dependencies among the instructions. We analyze the code and find out the formula for each element. Now there is no interlinking among different elements, and that could form the basis of the parallelization. In our first attempt to implement the parallelized code, we try to deal with the most basic functions, i.e. 5 fft functions: fft2, fft4, fft8, fft16 and fft32. According to our original thought, those functions are the ones that will be called by other functions hundreds (maybe thousands) of time. However, after the experiment we found out that the cost spend in communication among all participating nodes and other overhead will kill the speedup obtained by the parallelization itself.
9 Therefore, we did not parallel those basic functions. Instead, we change our focus to the higher level, that is, when fft496() calls fft124() and when fft8192() calls fft248(). In addition, each time we use 4 processors to distribute the calculation workloads. Also, in our implementation of parallelization, we did not adopt the multi-level technique. The advantage of multi-level parallelization is obvious and straightforward: it will parallel the sequential code to the most possible extent. However, the disadvantage is that this kind of parallelization greatly increases the difficulties in implementation. In addition, the communication cost and other overhead cannot be neglected. In our future work, we might consider the two-level parallelization technique in the implementation. 5.3 Analysis of speedup There are two kinds of time, which we have taken into account in analyzing our experiment result: CPU Time and Wall Clock Time (Elapsed Time). The former is the time spend in the calculation part of the code. The latter is the total elapsed time from the viewpoint of the users. It will be great if both of the above two kinds of time are decreased. However, elapsed time is much more non-deterministic than CPU time and is always out of control of our code. Furthermore, different strategies of assigning tasks among processors will have different results. We have tried two methods of distributing the workload. One is to distribute them into two machines (nodes), the other is to use four machines Both methods will assign the workload evenly among the nodes. In addition, before each running of our program, we will check and make sure that the nodes on which our code is running will have a CPU occupied ratio of less than 2 percent. From our experiment data, we find that in both of the two distribution strategies, CPU time will be decreased to different levels. When we use the data set of 496/2,/1 in four machines, which means the FFT length of 496, 2, iterations and one FFT for each iteration, we will get the maximum speedup in CPU time with a factor of around However, here we only consider the average CPU time of the nodes other than the head one. The CPU time on the head node does occupy a certain fraction of the wall clock time and is always a little bit more than the sequential CPU time. We believe that there exist some problems regarding the overhead on the head node because that node is the machine that will distribute the original data and gather the result for further processing such as recombining. Possible reasons include the networking bottleneck, the inefficient way of sending and receiving messages among different nodes, and maybe too many iterations as well. As for wall clock time, both of the two distribution methods have increased it. In addition, the total CPU time is only 31 to 37 percent of wall clock time (for FFT length of 496) and 47 to 52 percent of wall clock time (for FFT length of 8192). Therefore, it is obvious that most of the wall clock time is spent in the system itself, and we are still working in it to find out the exact reason(s). For both strategies, we did not obtain 2/4 times speedup, which is ideal speedup. The main reasons are as follows. First of all, there exist the competitions among different users in the same CPU. Each time we will choose those machines that are least occupied by other users to run our code. However we have no control on those CPUs, like operating systems do. So it could happen that during the running of our process on one CPU, that specific CPU will be chosen by other users and hence more occupied than it appeared to be. Secondly, due to the existing communication cost and some other overhead, it is impossible to obtain the ideal speedup in the real machines. In fact, with the increase of the size of our data set, the speedup increases as well. However, it is not true that the speedup will always continue increasing as we feed more input data to the nodes. We believe that the reason is when the running time increases, the risk of selected nodes being occupied by other processes also increases. Therefore, there is an optimal size of data set that will obtain the largest speedup. In our experiment, the parameters of 496/2,/1 running on four nodes are the best.
10 6 Conclusions We have parallelized the FFT part of AFNI software package. Our method is based on MPI. The result shows that for the FFT algorithm itself, we obtain a speedup of around 2.62 in CPU time. Future work includes the parallelization of the workhorse analysis tool in AFNI, 3dDeconvolve program, which carries out multiple linear regressions on voxel time series and computes associated statistics. Furthermore, our final goal is to make a near real-time AFNI system. References 1. Jean Talairach and Pierre Tournoux: Co-Planar Stereotaxic Atlas of the Human Brain. Thieme Medical Publishers, New York (1988) Peter S. Pacheco: Parallel Programming with MPI. Morgan Kaufmann Publishers, San Francisco, California (1997) Quinn: Parallel Algorithm, Chapter 8 The Fast Fourier Transform. Morgan Kaufmann Publishers, San Francisco, California (1997) Appendix A:! fft8 xcx[].r = (xcx[].r + xcx[4].r ) + (xcx[2].r + xcx[6].r ) + [(xcx[1].r + xcx[5].r ) + (xcx[3].r + xcx[7].r )] xcx[4].r = (xcx[].r + xcx[4].r ) + (xcx[2].r + xcx[6].r ) - [(xcx[1].r + xcx[5].r ) + (xcx[3].r + xcx[7].r )] xcx[1].r = (xcx[].r - xcx[4].r ] + csp[2].i * (xcx[2].r - xcx[6].r ] + csp[4].i * [(xcx[1].r - xcx[5].r ] + csp[2].i * (xcx[3].r - xcx[7].r )) xcx[5].r = (xcx[].r - xcx[4].r ) + csp[2].i * (xcx[2].r - xcx[6].r ) - csp[4].i * [(xcx[1].r - xcx[5].r ] + csp[2].i * (xcx[3 - xcx[7].r ]) xcx[2].r = (xcx[].r + xcx[4].r ] - (xcx[2].r + xcx[6].r ] + csp[5].i * [(xcx[1].r + xcx[5].r ] - (xcx[3].r + xcx[7].r )) xcx[6].r = (xcx[].r + xcx[4].r ] - (xcx[2].r + xcx[6].r ] - csp[5].i * [(xcx[1].r + xcx[5].r ] - (xcx[3].r + xcx[7].r )) xcx[3].r = (xcx[].r - xcx[4].r ] - csp[2].i * (xcx[2].r - xcx[6].r ] + csp[6].i * [(xcx[1].r - xcx[5].r ] - csp[2].i * (xcx[3].r - xcx[7].r )) xcx[7].r = (xcx[].r - xcx[4].r ] - csp[2].i * (xcx[2].r - xcx[6].r ] - csp[6].i * [(xcx[1].r - xcx[5].r ] - csp[2].i * (xcx[3].r - xcx[7].r )]! fft16
11 xcx[].r = [(xcx[].r + xcx[8].r ] + (xcx[4].r + xcx[12].r ]] + [(xcx[2].r + xcx[1].r ] + (xcx[6].r + xcx[14].r ]] + [(xcx[5].r + xcx[13].r ] + (xcx[1].r + xcx[9].r ]] + [(xcx[3].r + xcx[11].r ] + (xcx[7].r + xcx[15].r)) xcx[8].r = [(xcx[].r + xcx[8].r ] + (xcx[4].r + xcx[12].r ]] + [(xcx[2].r + xcx[1].r ] + (xcx[6].r + xcx[14].r ]] - [(xcx[5].r + xcx[13].r ] + (xcx[1].r + xcx[9].r ]] - [(xcx[3].r + xcx[11].r ] + (xcx[7].r + xcx[15].r )) xcx[1].r = [(xcx[].r - xcx[8].r ] + csp[2].i * (xcx[4].r - xcx[12].r ]] + csp[4].i * [(xcx[2].r - xcx[1].r ] + csp[2].i * (xcx[6].r + xcx[14].r ]] + csp[8].i * [csp[2].i * (xcx[5].r - xcx[13].r ] + (xcx[1].r - xcx[9].r ]] + csp[8].i * csp[4].i * [(xcx[3].r - xcx[11].r ) + csp[2].i * (xcx[7].r - xcx[15].r)] xcx[9].r = [(xcx[].r - xcx[8].r ] + csp[2].i * (xcx[4].r - xcx[12].r ]] + csp[4].i * [(xcx[2].r - xcx[1].r ] + csp[2].i * (xcx[6].r + xcx[14].r ]] - csp[8].i * [ csp[2].i * (xcx[5].r - xcx[13].r ] + (xcx[1].r - xcx[9].r ]] - csp[8].i * csp[4].i * [(xcx[3].r - xcx[11].r ) + csp[2].i * (xcx[7].r - xcx[15].r )] xcx[2].r = [(xcx[].r + xcx[8].r ] - ( xcx[4].r + xcx[12].r ]] + csp[5].i) * [(xcx[2].r + xcx[1].r ] - (xcx[6].r + xcx[14].r ]] + csp[9].i * [(xcx[5].r + xcx[13].r ] - (xcx[1].r + xcx[9].r ]] + csp[9].i * csp[5].i * [(xcx[3].r + xcx[11].r ) - (xcx[7].r + xcx[15].r )] xcx[1].r = [(xcx[].r + xcx[8].r ] - (xcx[4].r + xcx[12].r ]] + csp[5].i * [(xcx[2].r + xcx[1].r ] - (xcx[6].r + xcx[14].r ]] - csp[9].i * [(xcx[5].r + xcx[13].r ] - (xcx[1].r + xcx[9].r ]] - csp[9].i * csp[5].i * [(xcx[3].r + xcx[11].r ] - (xcx[7].r + xcx[15].r )) xcx[3].r = [(xcx[].r - xcx[8].r ] - csp[2].i * (xcx[4].r - xcx[12].r ]] + csp[6].i * [(xcx[2].r - xcx[1].r ] - csp[2].i * (xcx[6].r - xcx[14].r ]] + csp[1].i * [ - csp[2].i * (xcx[5].r - xcx[13].r ] + (xcx[1].r - xcx[9].r ]] + csp[1].i * csp[6].i * [(xcx[3].r - xcx[11].r ) - csp[2].i * (xcx[7].r - xcx[15].r )] xcx[11].r = [(xcx[].r - xcx[8].r ] - csp[2].i * (xcx[4].r - xcx[12].r ]] + csp[6].i * [(xcx[2].r - xcx[1].r ] - csp[2].i * (xcx[6].r - xcx[14].r ]] - csp[1].i * [ - csp[2].i * (xcx[5].r - xcx[13].r ] + (xcx[1].r - xcx[9].r ]] - csp[1].i * csp[6].i * [(xcx[3].r - xcx[11].r ) - csp[2].i * (xcx[7].r - xcx[15].r )] xcx[4].r = [(xcx[].r + xcx[8].r ] + (xcx[4].r + xcx[12].r ]] - [(xcx[2].r + xcx[1].r ] + (xcx[6].r + xcx[14].r ]] + csp[11].i * [(xcx[5].r + xcx[13].r ] + (xcx[1].r + xcx[9].r ]] - csp[11].i * [(xcx[3].r + xcx[11].r ] + (xcx[7].r + xcx[15].r )] xcx[12].r = [(xcx[].r + xcx[8].r ] + (xcx[4].r + xcx[12].r ]] - [(xcx[2].r + xcx[1].r ] + (xcx[6].r + xcx[14].r ]] - csp[11].i *
12 [(xcx[5].r + xcx[13].r ] + (xcx[1].r + xcx[9].r ]] + csp[11].i * [(xcx[3].r + xcx[11].r ] + (xcx[7].r + xcx[15].r )) xcx[5].r = [(xcx[].r - xcx[8].r ] + csp[2].i * (xcx[4].r - xcx[12].r ]] - csp[4].i * [(xcx[2].r - xcx[1].r ] + csp[2].i * (xcx[6].r + xcx[14].r ]] + csp[12].i * [ csp[2].i * (xcx[5].r - xcx[13].r ] + (xcx[1].r - xcx[9].r ]] - csp[12].i * csp[4].i * [(xcx[3].r - xcx[11].r ) + csp[2].i * (xcx[7].r - xcx[15].r )] xcx[13].r = [(xcx[].r - xcx[8].r ] + csp[2].i * (xcx[4].r - xcx[12].r ]] - csp[4].i * [(xcx[2].r - xcx[1].r ] + csp[2].i * (xcx[6].r + xcx[14].r ]] - csp[12].i * [ csp[2].i * (xcx[5].r - xcx[13].r ] + (xcx[1].r - xcx[9].r ]] + csp[12].i * csp[4].i * [(xcx[3].r - xcx[11].r ) + csp[2].i * (xcx[7].r - xcx[15].r )] xcx[6].r = [(xcx[].r + xcx[8].r ] - (xcx[4].r + xcx[12].r ]] - csp[5].i * [(xcx[2].r + xcx[1].r ] - (xcx[6].r + xcx[14].r ]] + csp[13].i * [(xcx[5].r + xcx[13].r ] - (xcx[1].r + xcx[9].r ]] - csp[13].i * csp[5].i * [(xcx[3].r + xcx[11].r ) - (xcx[7].r + xcx[15].r )] xcx[14].r = [(xcx[].r + xcx[8].r ] - ( xcx[4].r + xcx[12].r ]] - csp[5].i * [( xcx[2].r + xcx[1].r ] - (xcx[6].r + xcx[14].r ]] - csp[13].i * [(xcx[5].r + xcx[13].r ] - (xcx[1].r + xcx[9].r ]] + csp[13].i * csp[5].i * [(xcx[3].r + xcx[11].r ) - ( xcx[7].r + xcx[15].r )] xcx[7].r = [(xcx[].r - xcx[8].r ] - csp[2].i * (xcx[4].r - xcx[12].r ]] - csp[6].i * [(xcx[2].r - xcx[1].r ] - csp[2].i * (xcx[6].r - xcx[14].r ]] + csp[14].i * [ - csp[2].i * (xcx[5].r - xcx[13].r ] + (xcx[1].r - xcx[9].r ]] - csp[14].i * csp[6].i * [(xcx[3].r - xcx[11].r ] - csp[2].i * (xcx[7].r - xcx[15].r ]] xcx[15].r = [(xcx[].r - xcx[8].r ] - csp[2].i * (xcx[4].r - xcx[12].r ]] - csp[6].i * [(xcx[2].r - xcx[1].r ] - csp[2].i * (xcx[6].r - xcx[14].r ]] - csp[14].i * [ - csp[2].i * (xcx[5].r - xcx[13].r ] + (xcx[1].r - xcx[9].r ]] + csp[14].i * csp[6].i * [(xcx[3].r - xcx[11].r ) - csp[2].i * (xcx[7].r - xcx[15].r )] Appendix B: case 496: // modified by Huang and Xi // calculate array aa if(rank == ) fft_4dec(mode,124,aa); // calculate arrays bb, cc and dd, then split them into two //subarrays each
13 if(rank == 1) fft_4dec(mode,124,bb); for(i = ; i < M; i++) bb_r[i] = (bb[i]).r; bb_i[i] = (bb[i]).i; if(rank == 2) fft_4dec(mode,124,cc); for(i = ; i < M; i++) cc_r[i] = (cc[i]).r; cc_i[i] = (cc[i]).i; if(rank == 3) fft_4dec(mode,124,dd); for(i = ; i < M; i++) dd_r[i] = (dd[i]).r; dd_i[i] = (dd[i]).i; MPI_Barrier(MPI_COMM_WORLD); // send and receive the arrays if (rank == 1) MPI_Send(bb_r, M, MPI_FLOAT,,, MPI_COMM_WORLD); MPI_Send(bb_i, M, MPI_FLOAT,,, MPI_COMM_WORLD); if (rank == 2) MPI_Send(cc_r, M, MPI_FLOAT,,, MPI_COMM_WORLD); MPI_Send(cc_i, M, MPI_FLOAT,,, MPI_COMM_WORLD);
14 if (rank == 3) MPI_Send(dd_r, M, MPI_FLOAT,,, MPI_COMM_WORLD); MPI_Send(dd_i, M, MPI_FLOAT,,, MPI_COMM_WORLD); if (rank == ) MPI_Recv(bb_r,M, MPI_FLOAT, 1,, MPI_COMM_WORLD, &status); MPI_Recv(bb_i,M, MPI_FLOAT, 1,, MPI_COMM_WORLD, &status); MPI_Recv(cc_r,M, MPI_FLOAT, 2,, MPI_COMM_WORLD, &status); MPI_Recv(cc_i,M, MPI_FLOAT, 2,, MPI_COMM_WORLD, &status); MPI_Recv(dd_r,M, MPI_FLOAT, 3,, MPI_COMM_WORLD, &status); MPI_Recv(dd_i,M, MPI_FLOAT, 3,, MPI_COMM_WORLD, &status); // recombine bb, cc and dd for (i = ; i < M; i++) bb[i].r = bb_r[i]; bb[i].i = bb_i[i]; cc[i].r = cc_r[i]; cc[i].i = cc_i[i]; dd[i].r = dd_r[i]; dd[i].i = dd_i[i]; break; case 8192: // modified by Huang and Xi // calculate array aa if(rank == ) fft_4dec(mode,248,aa); // calculate arrays bb, cc and dd, then split them into two //subarrays each if(rank == 1) fft_4dec(mode,248,bb); for(i = ; i < M; i++) bb_r[i] = (bb[i]).r; bb_i[i] = (bb[i]).i;
15 if(rank == 2) fft_4dec(mode,248,cc); for(i = ; i < M; i++) cc_r[i] = (cc[i]).r; cc_i[i] = (cc[i]).i; if(rank == 3) fft_4dec(mode,248,dd); for(i = ; i < M; i++) dd_r[i] = (dd[i]).r; dd_i[i] = (dd[i]).i; MPI_Barrier(MPI_COMM_WORLD); // send and receive the arrays if (rank == 1) MPI_Send(bb_r, M, MPI_FLOAT,,, MPI_COMM_WORLD); MPI_Send(bb_i, M, MPI_FLOAT,,, MPI_COMM_WORLD); if (rank == 2) MPI_Send(cc_r, M, MPI_FLOAT,,, MPI_COMM_WORLD); MPI_Send(cc_i, M, MPI_FLOAT,,, MPI_COMM_WORLD); if (rank == 3) MPI_Send(dd_r, M, MPI_FLOAT,,, MPI_COMM_WORLD); MPI_Send(dd_i, M, MPI_FLOAT,,, MPI_COMM_WORLD);
16 if (rank == ) MPI_Recv(bb_r,M, MPI_FLOAT, 1,, MPI_COMM_WORLD, &status); MPI_Recv(bb_i,M, MPI_FLOAT, 1,, MPI_COMM_WORLD, &status); MPI_Recv(cc_r,M, MPI_FLOAT, 2,, MPI_COMM_WORLD, &status); MPI_Recv(cc_i,M, MPI_FLOAT, 2,, MPI_COMM_WORLD, &status); MPI_Recv(dd_r,M, MPI_FLOAT, 3,, MPI_COMM_WORLD, &status); MPI_Recv(dd_i,M, MPI_FLOAT, 3,, MPI_COMM_WORLD, &status); // recombine bb, cc and dd for (i = ; i < M; i++) bb[i].r = bb_r[i]; bb[i].i = bb_i[i]; cc[i].r = cc_r[i]; cc[i].i = cc_i[i]; dd[i].r = dd_r[i]; dd[i].i = dd_i[i]; break;
Using a Scalable Parallel 2D FFT for Image Enhancement
Introduction Using a Scalable Parallel 2D FFT for Image Enhancement Yaniv Sapir Adapteva, Inc. Email: yaniv@adapteva.com Frequency domain operations on spatial or time data are often used as a means for
More informationLatest Innovation For FFT implementation using RCBNS
Latest Innovation For FFT implementation using SADAF SAEED, USMAN ALI, SHAHID A. KHAN Department of Electrical Engineering COMSATS Institute of Information Technology, Abbottabad (Pakistan) Abstract: -
More informationAdvanced Topics UNIT 2 PERFORMANCE EVALUATIONS
Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Structure Page Nos. 2.0 Introduction 4 2. Objectives 5 2.2 Metrics for Performance Evaluation 5 2.2. Running Time 2.2.2 Speed Up 2.2.3 Efficiency 2.3 Factors
More informationIntroduction to Parallel Programming
Introduction to Parallel Programming Linda Woodard CAC 19 May 2010 Introduction to Parallel Computing on Ranger 5/18/2010 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor
More informationAssignment 3 MPI Tutorial Compiling and Executing MPI programs
Assignment 3 MPI Tutorial Compiling and Executing MPI programs B. Wilkinson: Modification date: February 11, 2016. This assignment is a tutorial to learn how to execute MPI programs and explore their characteristics.
More informationA TIMING AND SCALABILITY ANALYSIS OF THE PARALLEL PERFORMANCE OF CMAQ v4.5 ON A BEOWULF LINUX CLUSTER
A TIMING AND SCALABILITY ANALYSIS OF THE PARALLEL PERFORMANCE OF CMAQ v4.5 ON A BEOWULF LINUX CLUSTER Shaheen R. Tonse* Lawrence Berkeley National Lab., Berkeley, CA, USA 1. INTRODUCTION The goal of this
More informationDigital Signal Processing. Soma Biswas
Digital Signal Processing Soma Biswas 2017 Partial credit for slides: Dr. Manojit Pramanik Outline What is FFT? Types of FFT covered in this lecture Decimation in Time (DIT) Decimation in Frequency (DIF)
More informationPerformance Comparison between Blocking and Non-Blocking Communications for a Three-Dimensional Poisson Problem
Performance Comparison between Blocking and Non-Blocking Communications for a Three-Dimensional Poisson Problem Guan Wang and Matthias K. Gobbert Department of Mathematics and Statistics, University of
More informationPage 1. Program Performance Metrics. Program Performance Metrics. Amdahl s Law. 1 seq seq 1
Program Performance Metrics The parallel run time (Tpar) is the time from the moment when computation starts to the moment when the last processor finished his execution The speedup (S) is defined as the
More informationParallel Algorithms for the Third Extension of the Sieve of Eratosthenes. Todd A. Whittaker Ohio State University
Parallel Algorithms for the Third Extension of the Sieve of Eratosthenes Todd A. Whittaker Ohio State University whittake@cis.ohio-state.edu Kathy J. Liszka The University of Akron liszka@computer.org
More information6. Fast Fourier Transform
x[] X[] x[] x[] x[6] X[] X[] X[3] x[] x[5] x[3] x[7] 3 X[] X[5] X[6] X[7] A Historical Perspective The Cooley and Tukey Fast Fourier Transform (FFT) algorithm is a turning point to the computation of DFT
More informationProgramming with MPI. Pedro Velho
Programming with MPI Pedro Velho Science Research Challenges Some applications require tremendous computing power - Stress the limits of computing power and storage - Who might be interested in those applications?
More informationProgramming with MPI on GridRS. Dr. Márcio Castro e Dr. Pedro Velho
Programming with MPI on GridRS Dr. Márcio Castro e Dr. Pedro Velho Science Research Challenges Some applications require tremendous computing power - Stress the limits of computing power and storage -
More informationCHAPTER 6 A SECURE FAST 2D-DISCRETE FRACTIONAL FOURIER TRANSFORM BASED MEDICAL IMAGE COMPRESSION USING SPIHT ALGORITHM WITH HUFFMAN ENCODER
115 CHAPTER 6 A SECURE FAST 2D-DISCRETE FRACTIONAL FOURIER TRANSFORM BASED MEDICAL IMAGE COMPRESSION USING SPIHT ALGORITHM WITH HUFFMAN ENCODER 6.1. INTRODUCTION Various transforms like DCT, DFT used to
More informationScheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok
Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Texas Learning and Computation Center Department of Computer Science University of Houston Outline Motivation
More informationPeter Pacheco. Chapter 3. Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved
An Introduction to Parallel Programming Peter Pacheco Chapter 3 Distributed Memory Programming with MPI 1 Roadmap Writing your first MPI program. Using the common MPI functions. The Trapezoidal Rule in
More informationDistributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved
An Introduction to Parallel Programming Peter Pacheco Chapter 3 Distributed Memory Programming with MPI 1 Roadmap Writing your first MPI program. Using the common MPI functions. The Trapezoidal Rule in
More informationMPI Tutorial. Shao-Ching Huang. High Performance Computing Group UCLA Institute for Digital Research and Education
MPI Tutorial Shao-Ching Huang High Performance Computing Group UCLA Institute for Digital Research and Education Center for Vision, Cognition, Learning and Art, UCLA July 15 22, 2013 A few words before
More informationFaculty of Electrical and Computer Engineering Department of Electrical and Computer Engineering Program: Computer Engineering
Faculty of Electrical and Computer Engineering Department of Electrical and Computer Engineering Program: Computer Engineering Course Number EE 8218 011 Section Number 01 Course Title Parallel Computing
More informationRapid Prototyping System for Teaching Real-Time Digital Signal Processing
IEEE TRANSACTIONS ON EDUCATION, VOL. 43, NO. 1, FEBRUARY 2000 19 Rapid Prototyping System for Teaching Real-Time Digital Signal Processing Woon-Seng Gan, Member, IEEE, Yong-Kim Chong, Wilson Gong, and
More informationCOMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP
COMP4510 Introduction to Parallel Computation Shared Memory and OpenMP Thanks to Jon Aronsson (UofM HPC consultant) for some of the material in these notes. Outline (cont d) Shared Memory and OpenMP Including
More informationModule 9 : Numerical Relaying II : DSP Perspective
Module 9 : Numerical Relaying II : DSP Perspective Lecture 36 : Fast Fourier Transform Objectives In this lecture, We will introduce Fast Fourier Transform (FFT). We will show equivalence between FFT and
More informationTwiddle Factor Transformation for Pipelined FFT Processing
Twiddle Factor Transformation for Pipelined FFT Processing In-Cheol Park, WonHee Son, and Ji-Hoon Kim School of EECS, Korea Advanced Institute of Science and Technology, Daejeon, Korea icpark@ee.kaist.ac.kr,
More informationCSE. Parallel Algorithms on a cluster of PCs. Ian Bush. Daresbury Laboratory (With thanks to Lorna Smith and Mark Bull at EPCC)
Parallel Algorithms on a cluster of PCs Ian Bush Daresbury Laboratory I.J.Bush@dl.ac.uk (With thanks to Lorna Smith and Mark Bull at EPCC) Overview This lecture will cover General Message passing concepts
More informationChapter 3. Distributed Memory Programming with MPI
An Introduction to Parallel Programming Peter Pacheco Chapter 3 Distributed Memory Programming with MPI 1 Roadmap n Writing your first MPI program. n Using the common MPI functions. n The Trapezoidal Rule
More informationUnit 9 : Fundamentals of Parallel Processing
Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing
More informationQuiz for Chapter 1 Computer Abstractions and Technology
Date: Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [15 points] Consider two different implementations,
More informationAnadarko Public Schools MATH Power Standards
Anadarko Public Schools MATH Power Standards Kindergarten 1. Say the number name sequence forward and backward beginning from a given number within the known sequence (counting on, spiral) 2. Write numbers
More informationOpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.
OpenMP and MPI Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico November 15, 2010 José Monteiro (DEI / IST) Parallel and Distributed Computing
More informationParallel-computing approach for FFT implementation on digital signal processor (DSP)
Parallel-computing approach for FFT implementation on digital signal processor (DSP) Yi-Pin Hsu and Shin-Yu Lin Abstract An efficient parallel form in digital signal processor can improve the algorithm
More informationDESIGN METHODOLOGY. 5.1 General
87 5 FFT DESIGN METHODOLOGY 5.1 General The fast Fourier transform is used to deliver a fast approach for the processing of data in the wireless transmission. The Fast Fourier Transform is one of the methods
More informationHigh performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli
High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering
More informationB.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2
Introduction :- Today single CPU based architecture is not capable enough for the modern database that are required to handle more demanding and complex requirements of the users, for example, high performance,
More informationITCS 4145/5145 Assignment 2
ITCS 4145/5145 Assignment 2 Compiling and running MPI programs Author: B. Wilkinson and Clayton S. Ferner. Modification date: September 10, 2012 In this assignment, the workpool computations done in Assignment
More informationMessage Passing Interface (MPI)
CS 220: Introduction to Parallel Computing Message Passing Interface (MPI) Lecture 13 Today s Schedule Parallel Computing Background Diving in: MPI The Jetson cluster 3/7/18 CS 220: Parallel Computing
More informationPARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION
PARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION Stanislav Kontár Speech@FIT, Dept. of Computer Graphics and Multimedia, FIT, BUT, Brno, Czech Republic E-mail: xkonta00@stud.fit.vutbr.cz In
More informationCache-oblivious Programming
Cache-oblivious Programming Story so far We have studied cache optimizations for array programs Main transformations: loop interchange, loop tiling Loop tiling converts matrix computations into block matrix
More informationImage Compression System on an FPGA
Image Compression System on an FPGA Group 1 Megan Fuller, Ezzeldin Hamed 6.375 Contents 1 Objective 2 2 Background 2 2.1 The DFT........................................ 3 2.2 The DCT........................................
More informationChapter 1. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,
Chapter 1 Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Course Goals Introduce you to design principles, analysis techniques and design options in computer architecture
More information1:21. Down sampling/under sampling. The spectrum has the same shape, but the periodicity is twice as dense.
1:21 Down sampling/under sampling The spectrum has the same shape, but the periodicity is twice as dense. 2:21 SUMMARY 1) The DFT only gives a 100% correct result, if the input sequence is periodic. 2)
More informationCS 426. Building and Running a Parallel Application
CS 426 Building and Running a Parallel Application 1 Task/Channel Model Design Efficient Parallel Programs (or Algorithms) Mainly for distributed memory systems (e.g. Clusters) Break Parallel Computations
More information[4] 1 cycle takes 1/(3x10 9 ) seconds. One access to memory takes 50/(3x10 9 ) seconds. =16ns. Performance = 4 FLOPS / (2x50/(3x10 9 )) = 120 MFLOPS.
Give your answers in the space provided with each question. Answers written elsewhere will not be graded. Q1). [4 points] Consider a memory system with level 1 cache of 64 KB and DRAM of 1GB with processor
More informationOutline. CS38 Introduction to Algorithms. Fast Fourier Transform (FFT) Fast Fourier Transform (FFT) Fast Fourier Transform (FFT)
Outline CS8 Introduction to Algorithms Lecture 9 April 9, 0 Divide and Conquer design paradigm matrix multiplication Dynamic programming design paradigm Fibonacci numbers weighted interval scheduling knapsack
More informationDESIGN AND ANALYSIS OF ALGORITHMS. Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES
DESIGN AND ANALYSIS OF ALGORITHMS Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES http://milanvachhani.blogspot.in USE OF LOOPS As we break down algorithm into sub-algorithms, sooner or later we shall
More informationPerformance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals
Performance COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals What is Performance? How do we measure the performance of
More informationQuiz for Chapter 1 Computer Abstractions and Technology 3.10
Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: 1. [15 points] Consider two different implementations, M1 and
More informationPipelining and Vector Processing
Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline
More informationStudy and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou
Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm
More informationIntroduction to Medical Imaging. Cone-Beam CT. Klaus Mueller. Computer Science Department Stony Brook University
Introduction to Medical Imaging Cone-Beam CT Klaus Mueller Computer Science Department Stony Brook University Introduction Available cone-beam reconstruction methods: exact approximate algebraic Our discussion:
More informationHigh-Performance and Parallel Computing
9 High-Performance and Parallel Computing 9.1 Code optimization To use resources efficiently, the time saved through optimizing code has to be weighed against the human resources required to implement
More informationComplexity Measures for Map-Reduce, and Comparison to Parallel Computing
Complexity Measures for Map-Reduce, and Comparison to Parallel Computing Ashish Goel Stanford University and Twitter Kamesh Munagala Duke University and Twitter November 11, 2012 The programming paradigm
More informationSolution of Exercise Sheet 2
Solution of Exercise Sheet 2 Exercise 1 (Cluster Computing) 1. Give a short definition of Cluster Computing. Clustering is parallel computing on systems with distributed memory. 2. What is a Cluster of
More informationHPC Parallel Programing Multi-node Computation with MPI - I
HPC Parallel Programing Multi-node Computation with MPI - I Parallelization and Optimization Group TATA Consultancy Services, Sahyadri Park Pune, India TCS all rights reserved April 29, 2013 Copyright
More informationUnit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES
DESIGN AND ANALYSIS OF ALGORITHMS Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES http://milanvachhani.blogspot.in USE OF LOOPS As we break down algorithm into sub-algorithms, sooner or later we shall
More informationSDP Memo 048: Two Dimensional Sparse Fourier Transform Algorithms
SDP Memo 048: Two Dimensional Sparse Fourier Transform Algorithms Document Number......................................................... SDP Memo 048 Document Type.....................................................................
More informationAnalysis of Radix- SDF Pipeline FFT Architecture in VLSI Using Chip Scope
Analysis of Radix- SDF Pipeline FFT Architecture in VLSI Using Chip Scope G. Mohana Durga 1, D.V.R. Mohan 2 1 M.Tech Student, 2 Professor, Department of ECE, SRKR Engineering College, Bhimavaram, Andhra
More informationFiber Fourier optics
Final version printed as of 4/7/00 Accepted for publication in Optics Letters Fiber Fourier optics A. E. Siegman Ginzton Laboratory, Stanford University, Stanford, California 94305 Received The Fourier
More informationReview: Creating a Parallel Program. Programming for Performance
Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)
More informationCO Computer Architecture and Programming Languages CAPL. Lecture 15
CO20-320241 Computer Architecture and Programming Languages CAPL Lecture 15 Dr. Kinga Lipskoch Fall 2017 How to Compute a Binary Float Decimal fraction: 8.703125 Integral part: 8 1000 Fraction part: 0.703125
More informationIntroduction to Algorithms
Lecture 1 Introduction to Algorithms 1.1 Overview The purpose of this lecture is to give a brief overview of the topic of Algorithms and the kind of thinking it involves: why we focus on the subjects that
More informationAbstract. Introduction. Kevin Todisco
- Kevin Todisco Figure 1: A large scale example of the simulation. The leftmost image shows the beginning of the test case, and shows how the fluid refracts the environment around it. The middle image
More informationBindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core
Tuning on a single core 1 From models to practice In lecture 2, we discussed features such as instruction-level parallelism and cache hierarchies that we need to understand in order to have a reasonable
More informationDigital Image Processing. Image Enhancement in the Frequency Domain
Digital Image Processing Image Enhancement in the Frequency Domain Topics Frequency Domain Enhancements Fourier Transform Convolution High Pass Filtering in Frequency Domain Low Pass Filtering in Frequency
More informationParallel Programming Assignment 3 Compiling and running MPI programs
Parallel Programming Assignment 3 Compiling and running MPI programs Author: Clayton S. Ferner and B. Wilkinson Modification date: October 11a, 2013 This assignment uses the UNC-Wilmington cluster babbage.cis.uncw.edu.
More informationParallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)
Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication
More informationOpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.
OpenMP and MPI Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico November 16, 2011 CPD (DEI / IST) Parallel and Distributed Computing 18
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationFractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures
Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures University of Virginia Dept. of Computer Science Technical Report #CS-2011-09 Jeremy W. Sheaffer and Kevin
More informationCITS2401 Computer Analysis & Visualisation
FACULTY OF ENGINEERING, COMPUTING AND MATHEMATICS CITS2401 Computer Analysis & Visualisation SCHOOL OF COMPUTER SCIENCE AND SOFTWARE ENGINEERING Topic 3 Introduction to Matlab Material from MATLAB for
More informationThe Fast Fourier Transform Algorithm and Its Application in Digital Image Processing
The Fast Fourier Transform Algorithm and Its Application in Digital Image Processing S.Arunachalam(Associate Professor) Department of Mathematics, Rizvi College of Arts, Science & Commerce, Bandra (West),
More informationComputer Caches. Lab 1. Caching
Lab 1 Computer Caches Lab Objective: Caches play an important role in computational performance. Computers store memory in various caches, each with its advantages and drawbacks. We discuss the three main
More informationBack-Projection on GPU: Improving the Performance
UNIVERSITY OF MICHIGAN Back-Projection on GPU: Improving the Performance EECS 499 Independent Study Wenlay Esther Wei 4/29/2010 The purpose of this project is to accelerate the processing speed of the
More informationMPI. (message passing, MIMD)
MPI (message passing, MIMD) What is MPI? a message-passing library specification extension of C/C++ (and Fortran) message passing for distributed memory parallel programming Features of MPI Point-to-point
More informationTechnical Report. SLI Best Practices
Technical Report SLI Best Practices Abstract This paper describes techniques that can be used to perform application-side detection of SLI-configured systems, as well as ensure maximum performance scaling
More informationMPI Message Passing Interface
MPI Message Passing Interface Portable Parallel Programs Parallel Computing A problem is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information
More informationTheory Implementation Results Conclusion References. Cloud Computing. Special Task 2 - Parallel Merge Sort with MPI Summer Term 2018
Agrawal, Cocos, Merkl, Santos Summer Term 2018 Cloud Computing 1/19 Cloud Computing Special Task 2 - Parallel Merge Sort with MPI Summer Term 2018 Prachi Agrawal, Henry Cocos, David Merkl, Samuel Santos
More informationMPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh
MPI Optimisation Advanced Parallel Programming David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh Overview Can divide overheads up into four main categories: Lack of parallelism Load imbalance
More informationUniversity of Southern California Department of Electrical Engineering EE657 Spring 2K2 Instructor: Michel Dubois Homework #2.
University of Southern California Department of Electrical Engineering EE657 Spring 2K2 Instructor: Michel Dubois Homework #2. Solutions Problem 1 Problem 3.12 in CSG (a) Clearly the partitioning can be
More informationMPI: The Message-Passing Interface. Most of this discussion is from [1] and [2].
MPI: The Message-Passing Interface Most of this discussion is from [1] and [2]. What Is MPI? The Message-Passing Interface (MPI) is a standard for expressing distributed parallelism via message passing.
More informationParallel Fast Fourier Transform implementations in Julia 12/15/2011
Parallel Fast Fourier Transform implementations in Julia 1/15/011 Abstract This paper examines the parallel computation models of Julia through several different multiprocessor FFT implementations of 1D
More informationENT 315 Medical Signal Processing CHAPTER 3 FAST FOURIER TRANSFORM. Dr. Lim Chee Chin
ENT 315 Medical Signal Processing CHAPTER 3 FAST FOURIER TRANSFORM Dr. Lim Chee Chin Outline Definition and Introduction FFT Properties of FFT Algorithm of FFT Decimate in Time (DIT) FFT Steps for radix
More informationImplementation of Lifting-Based Two Dimensional Discrete Wavelet Transform on FPGA Using Pipeline Architecture
International Journal of Computer Trends and Technology (IJCTT) volume 5 number 5 Nov 2013 Implementation of Lifting-Based Two Dimensional Discrete Wavelet Transform on FPGA Using Pipeline Architecture
More informationUsing Java for Scientific Computing. Mark Bul EPCC, University of Edinburgh
Using Java for Scientific Computing Mark Bul EPCC, University of Edinburgh markb@epcc.ed.ac.uk Java and Scientific Computing? Benefits of Java for Scientific Computing Portability Network centricity Software
More informationWhy Multiprocessors?
Why Multiprocessors? Motivation: Go beyond the performance offered by a single processor Without requiring specialized processors Without the complexity of too much multiple issue Opportunity: Software
More information1 The range query problem
CS268: Geometric Algorithms Handout #12 Design and Analysis Original Handout #12 Stanford University Thursday, 19 May 1994 Original Lecture #12: Thursday, May 19, 1994 Topics: Range Searching with Partition
More information6.375 Ray Tracing Hardware Accelerator
6.375 Ray Tracing Hardware Accelerator Chun Fai Cheung, Sabrina Neuman, Michael Poon May 13, 2010 Abstract This report describes the design and implementation of a hardware accelerator for software ray
More informationTutorial: Analyzing MPI Applications. Intel Trace Analyzer and Collector Intel VTune Amplifier XE
Tutorial: Analyzing MPI Applications Intel Trace Analyzer and Collector Intel VTune Amplifier XE Contents Legal Information... 3 1. Overview... 4 1.1. Prerequisites... 5 1.1.1. Required Software... 5 1.1.2.
More informationAnnals of the University of North Carolina Wilmington Master of Science in Computer Science and Information Systems
Annals of the University of North Carolina Wilmington Master of Science in Computer Science and Information Systems Suppressing Independent Loops in Packing/Unpacking Loop Nests to Reduce Message Size
More informationC for Engineers and Scientists: An Interpretive Approach. Chapter 10: Arrays
Chapter 10: Arrays 10.1 Declaration of Arrays 10.2 How arrays are stored in memory One dimensional (1D) array type name[expr]; type is a data type, e.g. int, char, float name is a valid identifier (cannot
More informationMPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016
MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 Message passing vs. Shared memory Client Client Client Client send(msg) recv(msg) send(msg) recv(msg) MSG MSG MSG IPC Shared
More informationImproving Http-Server Performance by Adapted Multithreading
Improving Http-Server Performance by Adapted Multithreading Jörg Keller LG Technische Informatik II FernUniversität Hagen 58084 Hagen, Germany email: joerg.keller@fernuni-hagen.de Olaf Monien Thilo Lardon
More informationA Parallel Evolutionary Algorithm for Discovery of Decision Rules
A Parallel Evolutionary Algorithm for Discovery of Decision Rules Wojciech Kwedlo Faculty of Computer Science Technical University of Bia lystok Wiejska 45a, 15-351 Bia lystok, Poland wkwedlo@ii.pb.bialystok.pl
More informationFixed Point Streaming Fft Processor For Ofdm
Fixed Point Streaming Fft Processor For Ofdm Sudhir Kumar Sa Rashmi Panda Aradhana Raju Abstract Fast Fourier Transform (FFT) processors are today one of the most important blocks in communication systems.
More informationChapter 5A. Large and Fast: Exploiting Memory Hierarchy
Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM
More informationParallel FFT Program Optimizations on Heterogeneous Computers
Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid
More informationWhite Paper Taking Advantage of Advances in FPGA Floating-Point IP Cores
White Paper Recently available FPGA design tools and IP provide a substantial reduction in computational resources, as well as greatly easing the implementation effort in a floating-point datapath. Moreover,
More informationScalability of Heterogeneous Computing
Scalability of Heterogeneous Computing Xian-He Sun, Yong Chen, Ming u Department of Computer Science Illinois Institute of Technology {sun, chenyon1, wuming}@iit.edu Abstract Scalability is a key factor
More informationParallel Computing and the MPI environment
Parallel Computing and the MPI environment Claudio Chiaruttini Dipartimento di Matematica e Informatica Centro Interdipartimentale per le Scienze Computazionali (CISC) Università di Trieste http://www.dmi.units.it/~chiarutt/didattica/parallela
More informationIntroduction to High-Performance Computing
Introduction to High-Performance Computing Simon D. Levy BIOL 274 17 November 2010 Chapter 12 12.1: Concurrent Processing High-Performance Computing A fancy term for computers significantly faster than
More informationTowards Breast Anatomy Simulation Using GPUs
Towards Breast Anatomy Simulation Using GPUs Joseph H. Chui 1, David D. Pokrajac 2, Andrew D.A. Maidment 3, and Predrag R. Bakic 4 1 Department of Radiology, University of Pennsylvania, Philadelphia PA
More information