Analysis of the Parallelisation of the Duchamp Algorithm

Size: px

Start display at page:

Download "Analysis of the Parallelisation of the Duchamp Algorithm"

Mae McLaughlin
5 years ago
Views:

ivec Research Internships (2009-2010) Analysis of the Parallelisation of the Duchamp Algorithm Stefan Westerlund University of Western Australia Abstract A critical step in radio astronomy is to

1 ivec Research Internships ( ) Analysis of the Parallelisation of the Duchamp Algorithm Stefan Westerlund University of Western Australia Abstract A critical step in radio astronomy is to search images to determine the objects they contain. New telescope installations, such as the Murchinson Widefield Array (MWA) and the Australian Square Kilometer Array Pathfinder (ASKAP), are capable of observing the sky in superior resolution than previous telescopes. This increased resolution results in a much greater data output, so increased computing power is required in order to search this data for objects. The Square Kilometer Array (SKA) will produce even more data, and require even more computational power to search its output. A parallel application is required to make use of the required computing performance. The goal of this project is to examine the source finder program Duchamp to determine how it will perform in a parallel implementation, and to estimate the potential combinations of hardware to be used to run this parallel implementation. This is done by calculating the arithmetic intensity of Duchamp, and matching it to the arithmetic intensity of potential hardware. This comparison is performed using a black box model, to determine the overall performance of the computing system and its bandwidth A node model is also considered to determine the number, performance, memory and interconnect bandwidth of the individual nodes that comprise the parallel computer system. The results of this project suggest two potential computer systems, one consists of 392 nodes, each with an Intel Core i7 975 processor, at least 14.1GB of RAM and a network capable of providing a connection of at least 1.56GB/s (12.5Gbit/s) of bandwidth to each node. The second uses 67 nodes powered by nvidia Tesla C2070 GPUs, at least 82.1GB of RAM per node and network that can provide at least 4.32GB/s (34.6Gbit/s) of bandwidth to each node. Both systems should use a 100 Gigabit Ethernet network to transfer data to and from the system. Other configurations considered had memory per node requirements that exceed currently available commodity hardware.

2 S. Westerlund / ivec Research Internships ( ) 2 1. Introduction Modern telescope installations, such as the Square Kilometer Array (SKA) and the Australian Square Kilometer Array Pathfinder (ASKAP) search a much larger area of sky in a given amount of time than current telescopes. The SKA will produce data cubes that are terabytes in size. An all sky survey will produce thousands of such image cubes, resulting in a data set that is petabytes in size. All of this data will need to be searched to find the objects they contain. The problem is that the computational requirements to search these images dwarfs that available from desktop machines. Instead, supercomputers are needed to process these images in a reasonable amount of time. Therefore, the source finder programs used for searching the large astronomy images must be able to run in parallel, so they can make use of the computational power of current, parallel supercomputers. This project uses the Duchamp program as a representative example of a source finder, in order to examine the implications of running a source finder in a parallel computer network. The goal of this project is to examine the Duchamp source finding program and determine the effects of parallelising it. This is done by examining how many operations Duchamp requires and how much data transfer is required in order to search an area of sky. These values will be used to determine an appropriate combination of hardware to run the parallel version of Duchamp. This project will consider both the hardware of the computer system as a whole, and as a network of nodes. The Background section will describe the knowledge required to understand this report. The manner in which the Duchamp program will be analysed to understand how it will perform with a parallel implementation will be detailed in the Methodology section, along with how estimates for potential hardware configurations for this problem will be obtained. The results of evaluating the models from the Methodology section are shown in the Testing section. The implications of these results will be considered and the selection of hardware will be made in the Discussion section. The conclusions of this work are made, along with a discussion of the limitations of this project and the future work to be done to expand on this project, in the Conclusion section. 2. Background This section will first provide a broad introduction radio astronomy and the role of source finders. It will then detail Duchamp, the source finder chosen to be investigates in this project. Also described is the á trous image reconstruction algorithm which comprises the majority of the computational requirements of the Duchamp program. The concepts of arithmetic intensity and computational complexity will be explained as they are relevant to the understanding of this work.

3 S. Westerlund / ivec Research Internships ( ) Radio Astronomy Radio Astronomy is the study of celestial objects, by examining electromagnetic radiation in the radio spectrum. It is possible to examine this radiation from earth because it is in one of the frequency windows that is not blocked by Earth s atmosphere. Radio waves are beneficial to study because they pass through objects that are opaque to visible light, such as dust clouds. Radio astronomy also allows astronomers to observe objects that do not emit visible light, such as hydrogen clouds, as neutral hydrogen produces 21cm radiation [1]. Radio waves are often detected using arrays of telescopes because the signals can be combined between telescopes in a process called radio interferometry. Using multiple telescopes improves the angular resolution of the system, such that the angular resolution of two telescopes a certain distance apart is the same as a single telescope with a dish with a diameter of that distance. It takes a significant amount of computational power to process the signals received by telescopes into astronomy images. Signals from different telescopes are correlated together, taking account of their relative positions. The results are integrated over time, causing noise to cancel out towards zero and allowing fainter signals to be detected. This information is combined into a structure called a data cube. This cube has three dimensions, two of which are spatial dimensions that denote where in the sky that element represents, and the third dimension is the frequency channel that a particular element represents. Once these data cubes have been created, they need to be searched. Source Finders are programs that are used to find sources of electromagnetic radiation in an image. The quality of a source finder is measured in terms of its completeness and reliability. Completeness is a measure of how many of the actual sources in the data cube the source finder finds. Reliability is the proportion of the objects a source finder finds that are actual sources, rather than noise. Several source finders were considered for this report. These source finders include were MultiFind and TopHat, which were used by the HIPASS survey [2] and the Duchamp source finder [3]. Neither MultiFind nor TopHat were chosen because of requirements of completeness and reliability. In the HIPASS survey, people were used to confirm each of the sources found by these programs. MultiFind found around 83% and TopHat found around 90% of the sources that were deemed to be in the data, but each one found sources the other didn t. Additionally, MultiFind and TopHat found 137,060 and 17,232 sources respectively, compared to 4,315 sources in the final count [2]. The data from newer surveys, such as ASKAP, will have too many sources to be verified by people. Duchamp [3] is a new source finder program written by Dr. Matthew Whiting. Still under development, Duchamp uses a different algorithm. Because of this, Duchamp will be used as an estimate of how much computing power a source finder will need, and what are the effects of parallelising it. The detection algorithm used by Duchamp is to consider all elements above a certain threshold as bright elements. Adjacent bright elements are then joined together as objects. Objects that are below another threshold in size are discarded as noise, rather than actual sources. Duchamp uses pre-processing of the data cube to reduce the noise of the data cube, and to allow it to see fainter objects. The pre-processing uses image reconstruction with the á trous method [4], which is explained in detail in the next section.

4 S. Westerlund / ivec Research Internships ( ) The Á Trous Image Reconstruction Algorithm The á trous image reconstruction algorithm is a three-dimensional wavelet transform [5]. Through successive three-dimensional low-pass filtering, it considers the image at several scales. The filtered values at each scale are added to the output only if they are still greater than a threshold. The flowchart for the algorithm is shown in Figure 1. The algorithm is described in more detail in the following paragraphs. First the algorithm loads the original data cube as the input, and the values of the output data cube are initialised to zero. The data cube is operated on several iterations of the outside loop, with the stopping criteria dependent on the change in MAD (Median Absolute Deviation) from one iteration to the next, and a minimum of two iterations. For each of these iterations, first the data cube values are set to the original input minus the current output and the scale is initialised to one. The wavelet values are calculated by convolving the data cube is with a low-pass filter, and subtracting these values from the values of the data cube. The distance between the elements used in the filter is dependent on the scale. A threshold is calculated from the median of the wavelet values. The wavelet values that are greater than the threshold are added to the output. The data cube is then updated by subtracting the wavelet values from the current data cube values. The inner loop is repeated, incrementing the scale at each iteration, for a number of scales proportional to the logarithm of the shortest side length of the data cube. Once all the scales have been completed, the final filtered values are added to the output, without regard to a threshold. The stopping condition is then checked to see if another iteration of the outer loop should be performed. Once all the iterations are complete, the output data cube is returned. The exact operations performed by the algorithm are described in the next paragraph. Starting with the original data cube as the input, it is then convolved with a discrete filter. Consider x, y and z as the coordinates of an element in the image cube, d s,l is the real-valued data cube that comprises the image, at scale s and iteration l. α is the original input data cube and β l 1 is the output at the end of iteration l 1 and the start of iteration l. W = l f 2 where l f is the one-dimensional length of the filter used, and f [i][ j][k] are the coefficients of the threedimensional filter. Then the values of the data cube are updated from one scale and iteration to the next according to the following equations: W W W d s+1,l [x][y][z] = f [i][ j][k]d s,l [x + 2 s i][y + 2 s j][z + 2 s k] i= W j= W k= W d 1,l [x][y][z] = α[x][y][z] β l 1 [x][y][z] (1) This data access pattern is demonstrated, in two dimensions, in Figure 2. If a required element is outside of the cube, then a reflected element is used instead. The number of scales, S is dependent on the smallest side length of the data cube being examined. If the length of the shortest side is l min, then the number of scales is S = log 2 (l min ) 1. The wavelet coefficients, w s,l, at scale s and iteration l are then equal to the difference between

5 S. Westerlund / ivec Research Internships ( ) 5 Load Input Image Set Data Cube to Input Minus Output Set Scale to 1 Calculate Wavelet Coefficients Calculate Threshold Add Wavelet Values Greater than Threshold to Output Update Data Cube Values Using Wavelets Increment Scale No Scale > Max? Yes Add Data Cube Values To Output Continue Check Stopping Criteria Stop Return Output Figure 1: Á Trous Image Reconstruction Flowchart This flowchart shows the working of the à trous image reconstruction algorithm used by Duchamp. First it loads the original image as the input, and the values of the output data cube is initialised to zero. The image is operated on several iterations of the outside loop, with the stopping criteria dependent on the change in MAD (Median Absolute Deviation) from one iteration to the next, and a minimum of two iterations. For each of these iterations, the data cube is convolved with a filter over several scales. At each scale, data cube values that are above a threshold are added to the output.

6 S. Westerlund / ivec Research Internships ( ) ,3 2 2,3 2 2, ,2 1 1,2 1 1, ,3 1,2 1 X 1 1,2 2, ,2 1 1,2 1 1, ,3 2 2,3 2 2, Figure 2: Duchamp Data Access Pattern This diagram shows the values needed to calculate the next value of a given element. A twodimensional data set is used instead of a three-dimensional one for clarity. Likewise, only the elements required for the first three scales are shown. The element marked X is the element whose next value is being calculated. The numbers indicate in which scale the surrounding elements are used. Elements that are used in two different scales still need to be read twice, as the value of the surrounding pixels will have also changed value from one scale to the next. Note that the distance from the target element to the surrounding element doubles with each scale. Also, each coloured element will require the black element for their calculation, at that scale.

7 the data cube values at successive scales: S. Westerlund / ivec Research Internships ( ) 7 w s,l [x][y][z] = d s,l [x][y][z] d s+1,l [x][y][z] (2) The wavelet coefficients are then added to the output array, if and only if they are a certain threshold, t[s], above the median, m[ w s,l ]. This threshold is a constant, based on the current scale, s. The increase in the output as a result of scale s in iteration l, β s,l, is therefore calculated according to the following equation: w s,l [x][y][z] if w s,l [x][y][z] > m[ w s,l ] + t[s] β s,l [x][y][z] = (3) 0 otherwise The threshold is a constant value determined at the start of the program. It is dependent on the scale, and is multiplied by a value given in the program parameters. The output is calculated for S scales. The final filtered values for the data are added to the output, so the total output for iteration l of the algorithm, β l, is given according to the following equation: β l+1 [x][y][z] = β l [x][y][z] + d S +1,l [x][y][z] + β 0 [z][y][z] = 0 S β s,l [x][y][z] s=1 (4) The output after each iteration is calculated until the difference in the median absolute deviation from one iteration until the next is small enough. If M[ x] is the median absolute deviation of x and τ is the tolerance, then the stopping condition is evaluated according to the equation: M[ α β l ] M[ α β l 1 ] M[ α β l ] < τ (5) The tolerance is specified in the input parameters for the program. With the default setting, the algorithm usually takes three or four iterations Algorithmic Intensity The measure used to match the algorithm to potential hardware is Arithmetic Intensity. This value compares computation to data transfer. The arithmetic intensity of an algorithm is the number of operations it requires per byte of data transferred. The arithmetic intensity of a computer system is its operational performance divided by its bandwidth. If the arithmetic intensity of the algorithm is greater than that of the system, then the problem is computationally bound, and excess bandwidth will be unused. Conversely, if the arithmetic intensity of the algorithm is less than that of the computer system the algorithm is bandwidth bound, and excess computational performance will be unused. This metric can therefore be used to match an algorithm with suitable hardware.

8 S. Westerlund / ivec Research Internships ( ) 8 Equation 6 denotes how the arithmetic intensity of an algorithm and computer system is calculated. a a is the arithmetic intensity of the algorithm, a c is the algorithmic intensity of the computer system. p is the number of operations required for the algorithm, usually counted as FLOPs, of Floating Point OPerations. r is the number of bytes the algorithm needs to transfer. c is the computational power of the computer, in FLOP/s, Floating Point OPerations per second. b is the bandwidth of the computer system. If a a > a c then the problem is computationally bound, and excess bandwidth will be unused. If a a < a c then the algorithm is bandwidth bound, and extra computational performance will not be used. This is how an algorithm can be matched to an appropriate computer system. The number of operations required, and the related computational complexity, is discussed in the next section. a a = p r a c = c b (6) 2.4. Computational Complexity and Operation Counts Computational Complexity is a measure of how much effort is required to run an algorithm. It is often written as the upper bound, in big O notation. A function f (n) is of order O(g(n)) if Equation 7 holds for some constant k. lim f (n) kg(n) (7) n inf The computational complexity of the á trous image reconstruction method used by Duchamp is O(VS L), where V is the number of elements in the image, S is the number of scales and L is the number of iterations of the outer loop required. This can be used to estimate the running time of the program, based on the running time of the program with different input. If t is the running time of the program with parameters V, S and L and t 0 is the running time measured using parameters V 0, S 0 and L 0, then the running time can be estimated using Equation 8. VS L t t 0 (8) V 0 S 0 L 0 Examining the source code also allows for the operation counts to be determined. The the filtering portion of the image reconstruction algorithm requires 2180VS L single-precision floating point operations and 250VS L double-precision floating point operations. Calculating the median requires 48V log 10 (V)(S + 2)L single-precision floating point operations. This analysis considers the median algorithm to be a single-threaded implementation of introsort [6], followed by picking the middle element, as a worst-case scenario. Parallel, and more efficient, implementations exist, for example Bader, 2004 [7]. The single- and double-precision operations will be combined to derive an equivalent number of single-precision floating point operations. This will be done by considering a double-precision

9 S. Westerlund / ivec Research Internships ( ) 9 operation to be equivalent to two single-precision operations in the case of CPUs [8], two singleprecision operations for nvidia GPUs [9] and five single-precision operations in the case of AMD GPUs [10]. This is because different processors perform double-precision floating point operations at different speeds relative to how fast they can perform single-precision floating point operations. These are the operation counts that will be used in calculating the arithmetic intensity of the image reconstruction algorithm. 3. Methodology This section will describe how the Duchamp source finder program will be analysed. It will explain how the data cubes that are the input to Duchamp will be considered. Two models of the computing environment in which Duchamp will be run will be considered, a black box model and a node model. The two measures that will be applied to Duchamp, arithmetic intensity and computational complexity, are explained in this section. This report will consider a data cube with two spatial dimensions, X and Y, and a frequency dimension, F. This results in an image cube having XY F elements. Each element has D singleprecision values for each, where D is greater than or equal to one. These values specify different properties of the element, including one element denoting the brightness of that element. Therefore, with a single-precision floating point number requiring four bytes of storage, the total file size for the data cube is 4XYFD bytes. Because Duchamp only uses the brightness value for an element, only this value will be considered when determining how much memory Duchamp needs to store all its data Black Box Model The first model is a black box model, as shown in Figure 3. This considers the computer system as a black box, with a certain computational rate and a bandwidth that determines the rate at which data is moved on and off the system. This model will compare the total number of floating point operations required by the algorithm to the amount of data transfer of moving the input data cube onto the system and moving the output data catalogue off the system. This model will help determine the overall performance of the potential system. This model gives particular values to use to calculate the arithmetic intensities, as shown in Equation 6. p is the number of floating point operations required, as given in Section 2.4. The value of r is equal to the file size of the image cube, 4XYFD plus the file size of the catalogue, in bytes. Although only one value per element is used by Duchamp, this report will consider all D values per element being transferred to the computer system, as a worst-case situation. The value c is equal to the total computational performance of the computer system, in FLOP/s and b is the bandwidth of the communication link that moves data on and off the system.

10 S. Westerlund / ivec Research Internships ( ) Node Model The second model is the node model, as shown in Figure 4. This concerns the computation once all the data is on the system. It considers a series of nodes, each capable of a certain computational performance. These are connected by an interconnect that has a certain bandwidth from one node to another. This will compare the operations required by the algorithm against the data it needs to transfer. This model will also consider the memory requirements of the system. Examining the source code shows that the image reconstruction algorithm uses five singleprecision values in memory for each element in the data cube, so each node requires 5 4V bytes of RAM to store all the required data in memory. In determining the arithmetic intensity, the node model uses the same number of floating point operations as the black box model for p. The computational performance of the computer system, c, is the computational performance of a single node. The system bandwidth, b, is the bandwidth of the interconnect from one node to another. The amount of data transfer required, r, is more complicated, as it depends on the number of nodes used, and how the elements in the data cube are distributed between the nodes. The amount of the data transfer required is calculated in the following paragraphs. This analysis will consider each node to hold an m m m cube of elements from the data cube, and be responsible for performing the operations required for these elements. If there are a total of v elements in the data cube then the number of nodes used is n = v, so m = m 3 3 v n. For an a filter with length 2w + 1, for some positive integer w and at each scale s, to calculate the wavelet coefficient for a given element, the node holding that element needs the value of the elements 2 s 1, 2 2 s 1, 3 2 s+1,..., w 2 s 1 values away, on either side, in each dimension. This requires that each node in the computing network store not only the data cube values for the elements it is responsible for computing, but also the elements that surround these in the data cube. These extra values that are not operated on by a node, but are used in calculations for the elements on that node, are called a halo. If a node needs a element d values away from than element in each of three dimensions and a node works on a cube of elements with side length m, then the elements a node in the computer network needs form a cube with side lengths of m + 2d elements. The amount of data that needs to be transferred to a node is equal to the volume of the above cube stored in that node, minus the size of the cube itself, as the node already holds its own values. Therefore, the amount of data transfer needed per node, d n is: d n = (m + 2d) 3 m 3 = 6m 2 d + 12md 2 + 8d 3 (9) The total amount of data needing transfer from one node to another is equal to the data transfer needed per node multiplied by the number of nodes. Thus the number of elements being transferred per pass of the filter, d f, is equal to: d f = nd n = (6m 2 d + 12md 2 + 8d 3 )n = 6nm 2 d + 12nmd 2 + 8nd 3 (10)

S. Westerlund / ivec Research Internships (2009-2010) 11 Data Cube Bandwidth Computer System Bandwidth Catalogue Computational Performance Figure 3: Black Box Model This figure shows the black box

11 S. Westerlund / ivec Research Internships ( ) 11 Data Cube Bandwidth Computer System Bandwidth Catalogue Computational Performance Figure 3: Black Box Model This figure shows the black box model. It compares the computation required to complete an algorithm on the input given, against the data transfer required to move the input onto the computer system and to move the output from the system. This can be used to determine a potential performance for the computer system, when using a given data transfer technology. Nodes Computational Performance Interconnect Bandwidth Figure 4: Node Model This figure shows the node model. It considers a number of computing nodes that are each capable of a certain computational performance, and are connected together with an interconnect of a certain bandwidth. Each node hold part of the data cube, and works with the other nodes to calculate the result. This model can be used to match the algorithm to a given computational performance, bandwidth, number of nodes in the network, and memory per node.

12 S. Westerlund / ivec Research Internships ( ) 12 Substituting in m = 3 v n gives the amount of transfer per pass of the filter as a function of the total data cube size and the number of nodes. d f = 6n 1 3 v 2 3 d + 12n 2 3 v 1 3 d 2 + 8nd 3 (11) Thus the amount of data transfer needed is proportional to the number of nodes. The maximum amount of elements stored on a given node is limited by the amount of memory that node has, divided my the amount of memory it needs per element to be able to operate on that element. For this algorithm, the distance is dependent on the scale: d = w2 s 1, for s = 1 to S. The amount of data transferred at scale s, d s is: d s = 6n 1 3 v 2 3 w2 s n 2 3 v 1 3 w 2 2 2s 2 + 8nw 3 2 3s 3 (12) The amount of information transferred for each iteration of the main loop, d l is therefore: d l = S s=1 6n 1 3 v 2 3 w2 s n 2 3 v 1 3 w 2 2 2s 2 + 8nw 3 2 3s 3 (13) is: And so the total amount of information needing transfer for L iterations of the main loop, d t d t = L S s=1 6n 1 3 v 2 3 w2 s n 2 3 v 1 3 w 2 2 2s 2 + 8nw 3 2 3s 3 (14) The amount of data transfer required for a given scale reaches a maximum when all the data needed in that scale comes from elements that are stored in a different node from the center element. When this happens, the data transfer for that scale stays constant as the number of processes increases. This is because each node needs data from w 3 1 other nodes, rather than needing all the values in a certain distance in the data cube. The data access patterns between nodes can be seen in Figures 5 and 6. This calculation overestimates the amount of data transfer because it fails to account for the edge cases of the data cube. When an element near the edge of the data cube needs the value of an element that is outside the data cube, a reflected value is used instead. This reflected element may lie in the current node, or in data that has already been loaded from another node, meaning that the reflected value does not need to be loaded itself. The portion of the data that does not need to be loaded because of this increases as the number of elements per node increases relative to the total number of elements. Thus, the overestimation is greatest when the fewest number of nodes are used. This is why the amount of data transfer does not decrease to zero and the number of nodes is one.

13 S. Westerlund / ivec Research Internships ( ) 13 Figure 5: Duchamp Node Data Access Pattern This diagram shows the elements a node needs from other nodes, for different scales. A twodimensional data set is shown instead of a three-dimensional data set for clarity. This example uses a node size value of m = 4, a filter of length 5, so w = 2 and shows four scales, S = 4. The boundaries between nodes are shown by the thick black lines. The black elements are the elements whose wavelet values are being calculated by a given node. The first scale uses the blue elements. The second scale uses the blue and red elements. The third scale uses the blue, red, green and yellow elements. The fourth scale requires the yellow and brown elements. Note how for the first three scales, all the elements in a certain distance around the node are needed, but at the fourth scale only certain blocks of values are needed, with gaps in between. In this example, the data transfer between nodes reaches a maximum at the fourth scale, and remains the same for higher scales.

14 S. Westerlund / ivec Research Internships ( ) 14 Figure 6: Duchamp Node Data Access Pattern This diagram shows the elements a node needs to calculate the wavelet coefficients for its own elements, when the node size is not a power of two. This example uses a node size value of m = 5, a filter of length 5, so w = 2 and shows four scales, S = 4. The first scale used the blue elements. The second scale uses the blue, red and orange elements. The third scale uses the blue, red, orange, yellow and green elements. The fourth scale uses the orange, green and brown elements. Note that the elements needed by a particular node to not align with the elements other nodes hold.

15 S. Westerlund / ivec Research Internships ( ) 15 The data transfer for the median requires one transfer of the image each time the median is calculated. The amount of data transfer per calculation of the median is d m = V. The median is called once per scale and twice at then end of each iteration, so the total number of elements needing transfer for calculation of the median, d m,t is: d m,t = d m (S + 2)L = V(S + 2)L (15) 4. Results There are a number of steps in analysing how the performance of a parallel implementation of Duchamp. First the input to Duchamp will be defined, for use in the remainder of the testing. Preliminary testing was first performed to examine the single-threaded implementation, to determine its running time and most computationally intensive methods. Duchamp was then analysed according to the black box model to match the computational requirements of the entire computer system with the speed of the connection used to transfer data on and off the system. Duchamp was then considered using the node model. This model relates the number of nodes, the computational speed and memory of each node, and the speed of the interconnect between nodes. There are two data cubes that will be considered in this analysis. The first is a data cube of the Virgo cluster, made from data from the HIPASS survey. This cube has spatial dimensions of X Y = , and F = 256 frequency channels. From these dimensions, the number of scales for this cube, S will be six. This cube only has one value per element, so D = 1 and the file size is 116MB. This data cube will be used to test the Duchamp program. The second cube is a hypothetical cube that may be produced by ASKAP, to be used to estimate what hardware a computer system would need to process such a cube. This cube has spatial dimensions of X Y = 4, 096 4, 096 and frequency channels F = 16, 384. This results in a number of scales, S, of ten. The ASKAP cube may have D = 5 values per element and a file size of 5.5TB. Of the five values, four are the Stokes parameters that determine the polarisation of that electromagnetic radiation, including the brightness, and the fifth being a weighting that measures how exposed that element was over the time the data cube was produced. As a preliminary test, Duchamp was first run using the HIPASS cube as input. This was to provide an estimate of the time needed, and to check what methods used the majority of the computing time. This test was run on a system with a dual-core AMD 1.8GHz Opteron 265 processor with 4GB of DDR2 memory. The filter chosen for this test, and for the ASKAP data cube, was of length five, so w = 2. Running the Duchamp program on this system with the HIPASS cube as input took 30 minutes. This required three iterations for the outermost loop. This report will estimate that the ASKAP will require four iterations of the outermost loop. Using these values, as estimate can be made for how long Duchamp will run when processing an ASKAP cube. Using Equation 8, the estimate of the time taken, T, is:

16 S. Westerlund / ivec Research Internships ( ) 16 4, 096 4, , T 30 minutes , 000 minutes 440 days (16) This test produces a catalogue output that is 70kB in size. This shows that for the black box model, the size of the output can be ignored because it is negligible compared to the size of the input. Duchamp was profiled with the gprof program in order to determine which method calls take the greatest portion of the running time. Analysis of the operation counts, in section 2.4, suggests that the á trous image reconstruction algorithm and the determining the median would comprise the majority of the computational requirements of Duchamp. Executing Duchamp with the HIPASS data cube shows that the á trous image reconstruction algorithm takes 95% of the running time, including 17% for calculating the median. This confirms that these are the most time consuming parts of the Duchamp program. The amount of operations required is a function of the cube, and independent of the number of processors used. For the filtering, the HIPASS cube requires 1.23 single-precision TFLOPs and 130 double-precision GFLOPs. For calculating the median, it requires 248 single-precision GFLOPs. The ASKAP cube requires 24 single-precision EFLOPs and 2.75 double-precision EFLOPs for filtering and 7.24 single-precision EFLOPs for calculating the median. The black box arithmetic intensity can be calculated from these variables. How the black box arithmetic intensity varies with the size of the data cube is shown in Figure 7. The black box arithmetic intensity of the HIPASS and ASKAP cubes is compared to different network technologies in Figure 8. The technologies shows are Gigabit Ethernet at 125MB/s, 10 Gigabit Ethernet at 1.25GB/s, the proposed 100 Gigabit Ethernet at 12.5GB/s [11] and InfiniBand QDR 4X at 4.00GB/s [12]. The node model arithmetic intensity of the algorithm changes with the number of nodes used, as the amount of data transfer varies. The amount of data transfer required is shown in Figure 9. Comparing this with the amount of operations required, the arithmetic intensity can be calculated. The arithmetic intensity of the HIPASS and ASKAP cubes, as the number of nodes varies, is shown in Figures 11 and 12, respectively. Comparing these values against available hardware links a potential combination of hardware to the optimum number of nodes, as shown in Figure 13. The processors shown are an Intel Core i7 975, with a single-precision performance of 213 GFLOP/s [13], a nvidia Tesla C2070 with a single-precision performance of 1.26 TFLOP/s [14] and an AMD Radeon HD 5970 with a single-precision performance of 4.64 TFLOP/s [10]. The interconnects used are 10 Gigabit Ethernet at 250MB/s [11], InfiniBand QDR 4X at 8.00GB/s [12], PCI Express v2 x16 at 16.0GB/s[15], and the proposed 100 Gigabit Ethernet at 25.0GB/s [11]. These figures are twice the one-way bandwidth because the Duchamp algorithm can benefit from transferring information in both directions with full-duplex interconnects. How the RAM requirements for each node varies with the number of nodes is shown in Figure 14.

17 S. Westerlund / ivec Research Internships ( ) FLOPs per Byte Transferred e+06 1e+09 1e+12 Number of Elements in Data Cube HIPASS ASKAP Figure 7: Duchamp Black Box Computational Intensity This graph shows how the approximate computational intensity for the Duchamp program varies with the number of elements in the data cube size. This is measured as the number of combined single- and double-precision FLOPs required per byte of data transferred onto the computer system. This graph assumes that each element in the data cube has five single-precision floating point values associated with it, and that four iterations of the main loop are performed. Because of this, the HIPASS cube shown here shows less arithmetic intensity than it does in practice. This graph was made to determine how the balance of computation and data transfer varies for different data cube sizes.

18 S. Westerlund / ivec Research Internships ( ) 18 Computational Performance (TFLOP/s) System Bandwidth (Gigabytes per Second) HIPASS Computational Intensity ASKAP Computational Intensity Gigabit Ethernet 10 Gigabit Ethernet InfiniBand QDR 4X 100 Gigabit Ethernet Figure 8: Duchamp Black Box Technological Requirements This graph compares the floating point performance of the computer system required to keep up with a given bandwidth that transfers data onto the computer system. The x-axis is the bandwidth used to transfer data onto the system in bytes per second and the y-axis is the floating point performance, in floating point operations per second. Also shown are several common network technologies and their bandwidths. This is made in order to match the connection bandwidth to the overall computational performance of the computer system.

19 S. Westerlund / ivec Research Internships ( ) Data Transfer Between Nodes (TB) Number of Nodes HIPASS ASKAP Figure 9: Duchamp Data Transfer This figure shows how the amount of data needing transfer from one node to another varies with the number of nodes. The x-axis is the number of nodes in the computer system and the y-axis is the amount of data, measured in terabytes. This graph calculates the amount of data transfer as the number of nodes in the system varies, to be used to calculate the arithmetic intensity of the image reconstruction algorithm when calculated using different numbers of nodes.

20 S. Westerlund / ivec Research Internships ( ) Data Transfer Between Nodes (TB) e+06 1e+09 1e+12 Number of Nodes HIPASS ASKAP Figure 10: Duchamp Data Transfer for Large Numbers of Nodes This diagram shows how much the data transfer varies for large numbers of nodes. The x-axis is the number of nodes. Each line starts at one node at ends at a number of nodes equal to the number of elements in that data cube. The y-axis is the amount of data transferred from one node to another, in terabytes. This graph shows the data transfer required for grater numbers of nodes in order to show the effects of the image reconstruction algorithm reaching the maximum data transfer for a given scale.

21 S. Westerlund / ivec Research Internships ( ) FLOPs per Byte Transferred Number of Nodes Single-Precision FLOPs Double-Precision FLOPs Combined FLOPs Figure 11: HIPASS Computational Intensity This graph shows the computational intensity of the Duchamp 3D image reconstruction algorithm, when run on the HIPASS data cube. This arithmetic intensity can be used to match the algorithm to suitable hardware. The first line shows the single-precision algorithmic intensity. The second line shows the double-precision algorithmic intensity. The third line shows the computational intensity of single- and double-precision operations together, counting one double-precision operation as two single-precision operations. These lines show the single- and double-precision operations because they are performed at different speeds by CPUs and GPUs.

22 S. Westerlund / ivec Research Internships ( ) FLOPs per Byte Transferred Number of Nodes Single-Precision FLOPs Double-Precision FLOPs Combined FLOPs Figure 12: ASKAP Computational Intensity This graph shows the computational intensity of the Duchamp 3D image reconstruction algorithm, when run on the ASKAP data cube. This arithmetic intensity can be used to match the algorithm to suitable hardware. The first line shows the single-precision algorithmic intensity. The second line shows the double-precision algorithmic intensity. The third line shows the computational intensity of single- and double-precision operations together, counting one double-precision operation as two single-precision operations. These lines show the single- and double-precision operations because they are performed at different speeds by CPUs and GPUs.

23 S. Westerlund / ivec Research Internships ( ) Number of Nodes Interconnect Bandwidth (Gigabytes per Second) Intel Core i7 975 nvidia Tesla C2070 AMD Radeon HD Gigabit Ethernet InfiniBand QDR 4X PCI-E v2, x Gigabit Ethernet Figure 13: Duchamp Optimum Number of Nodes This graph shows the optimum number of nodes to use for the Duchamp algorithm on an ASKAP data cube. The number of nodes is a function of the bandwidth of the interconnect used and the chosen processor. If a greater number of nodes is used, then there is not enough bandwidth to keep up with the extra computational performance and increased data transfer. If a fewer number of nodes is used, then bandwidth will go unused as the system waits for calculations to complete. The arithmetic intensity figure used for each processor takes into account the relative speed of single- and double-precision floating point operations on that processor. This graph uses arithmetic intensity of the algorithm to match a processor speed and interconnect bandwidth to the number of nodes used to make the system.

24 S. Westerlund / ivec Research Internships ( ) TB 1 TB RAM per Node Needed 100 GB 10 GB 1 GB 100 MB Number of Nodes HIPASS Data Cube ASKAP Data Cube Figure 14: Duchamp Memory Requirements This graph shows how many nodes are needed, for a given amount of RAM per node, to store all the information needed in RAM. The x-axis is the number of nodes in the computer system, and the y-axis is the amount of RAM each node needs to store all the data required by Duchamp. This RAM is used to avoid the longer access times of secondary storage. This relation between the amount of RAM needed and the number of nodes, and the maximum amount of RAM per node in available technology, forms a lower bound on the number of nodes that can be used to execute Duchamp.

25 S. Westerlund / ivec Research Internships ( ) Discussion The black box arithmetic intensity is first calculated. A technology for the system bandwidth can be chosen, and from the black box arithmetic intensity the overall computational performance of the system can be determined. An additional choice of processor can then be made, and compared to the overall computational performance to determine how many nodes are needed. The node model is then considered. Using the node arithmetic intensity from this model, the interconnect bandwidth can be determined from the number of nodes and the performance of the chosen processor. The node model also determines the amount of memory each node will need, from the number of nodes used. Therefore the information obtained from these models is used to estimate an potential combination of hardware to be used to execute the Duchamp program Black Box Model The arithmetic intensity of Duchamp is first considered using the black box model. This is done to match the computational complexity of the entire system with the bandwidth of the connection that is used to transfer data to and from the system. The black box arithmetic intensity increases with the size of the data cube, as shown in Figure 7. This test is done to show how the black box arithmetic intensity of the algorithm varies with different-sized data cubes. The arithmetic intensity increases because the number of operations required is proportional to the size of the data cube increases and the number of scales increases, but the amount of data transfer required is only proportional to the size of the data cube. The jumps in the graph occur as the data cube becomes large enough that another scale is needed for the filtering. This suggests that a proportionally faster computer system, compared to the bandwidth, can be used as the size of the image increases. For simplicity, this graph uses three assumptions to show the arithmetic intensity as a function of the data cube size. First, it assumes that the data cube has the same length in each of the three dimensions, so that the number of scales is only a function of the total number of elements in the cube, rather than considering the smallest side length. The second assumption is that the number of values per element that need to be transferred is D = 5 and the third is that the number if iterations of the outermost loop is L = 4. Because of this, the arithmetic intensity shown here is only approximate. In particular, the arithmetic intensity of the HIPASS data cube is higher than that shown, because this graph overestimates how many values are needed to be transferred onto the system. The actual arithmetic intensities of the HIPASS and ASKAP data cubes are shown in Figure 8. This graph shows that the black box arithmetic intensity of Duchamp using the HIPASS data cube greater than that when using the ASKAP data cube. This is because the HIPASS cube required less data transfer per element of the data cube. From this graph, a network technology can be chosen and the appropriate computational power of the system can be determined Node Model We now consider Duchamp using the node model. In order to calculate the arithmetic intensity of the system, the number of operations and amount of data transfer must be known. The

26 S. Westerlund / ivec Research Internships ( ) 26 number of operations required, as calculated from the equations in Section 2.4 are given on page 16. The amount of data transfer is more complex and shown in Figure 9. These graphs show how the amount of data transfer needed increases with the number of nodes. As the number of nodes increases, the amount of data transfer required approaches a linear increase with the number of nodes. There are a sudden decreases in slope present in these graphs. These occur when the maximum data transfer is reached for a particular scale, so the data transfer for that scale stops increasing with the number of nodes. These changed can be seen more clearly in Figure 10. This plot shows the data transfer of Duchamp using the two data cubes, from using a single node for the entire data cube to using a single node for each element in the data cubes. Note that these plots overestimate the amount of data transfer required, particularly for low numbers of nodes. This is because this model does not account for reflection of edge values, where the element of a needed value lies outside the data cube, and instead the value of a reflected element is used instead. The reflected element may lie in the original node, or overlap with elements needed from another node. With these results, the arithmetic intensity of the Duchamp algorithm when using the HIPASS and ASKAP data cubes as input can be calculated. The arithmetic intensity of Duchamp using the HIPASS data cube is shown in Figure 11 and the algorithmic intensity using the ASKAP data cube is shown in Figure 12. These figures show how the arithmetic intensity decreases as the number of nodes increases. This is because the number of operations is constant with the number of nodes, but the data transfer needed increases. These figures each show three lines. These are the arithmetic intensities calculated using the single-precision floating point operations, double-precision operations and the equivalent combined operation counts. Comparing this arithmetic intensity against available hardware can be used to determine the optimum number of nodes. Figure 13 shows the optimum number of nodes for a given combination of processor and interconnect bandwidth. There is a sudden jump in the optimum number of nodes, near 2000 nodes. This because a scale reaches maximum data transfer, so the slope of the data transfer decreases and the slope of the arithmetic intensity of the Duchamp algorithm increases. The slope of the optimum number of nodes otherwise decreases, as the data transfer required increases. The last factor in the node model is the amount of memory each node needs. Figure 14 shows how the amount of RAM each node needs varies with the number of nodes. Combined with a maximum amount of RAM per node on available technology, this relation forms a lower limit on the number of nodes that can be effectively used to run Duchamp. As a node needs five single-precision floating point values for each element of a data cube it holds, the amount of RAM required is proportional to the number of elements in the image, and inversely proportional to the number of nodes in the computer system. The graph decreases in steps for low numbers of nodes because only integer numbers of nodes are considered Hardware Choices There are a number of constraints that affect what the potential choice of hardware for running Duchamp on an ASKAP-size data cube. The computer system should finish computation on the data cube in an equivalent amount of time to transferring the data cube onto the system.

PARALLEL PROGRAMMING MANY-CORE COMPUTING: THE LOFAR SOFTWARE TELESCOPE (5/5)

PARALLEL PROGRAMMING MANY-CORE COMPUTING: THE LOFAR SOFTWARE TELESCOPE (5/5) Rob van Nieuwpoort Vrije Universiteit Amsterdam & Astron, the Netherlands Institute for Radio Astronomy Why Radio? Credit: NASA/IPAC