Parallelization and Performance of 3D Ultrasound Imaging Beamforming Algorithms on Modern Clusters

Size: px

Start display at page:

Download "Parallelization and Performance of 3D Ultrasound Imaging Beamforming Algorithms on Modern Clusters"

Norah Cross
5 years ago
Views:

1 Parallelization and Performane of 3D Ultrasound Imaging Beamforming Algorithms on Modern Clusters F. Zhang, A. Bilas, A. Dhanantwari, K.N. Plataniotis, R. Abiprojo, and S. Stergiopoulos Dept. of Eletrial and Computer Engineering, 1 King s College Road, University of Toronto, Toronto, Ontario, M5S3G4, Canada Defense R&D Canada,Toronto 1133 Sheppard Ave. West, North York, Ontario, M3M3B9, Canada {fanzhang, bilas@eeg.toronto.edu, kostas@dsp.toronto.edu, {amar.adhanant, robert.abiprojo, stergios.stergiopoulos@drd-rdd.g.a ABSTRACT Reently there has been a lot of interest in improving the infrastruture used in medial appliations. In partiular, there is renewed interest on non-invasive, high-resolution diagnosti methods. One suh method is digital, 3D ultrasound medial imaging. Current state-of-the-art ultrasound systems use speialized hardware for performing advaned proessing of input data to improve the quality of the generated images. Suh systems are limited in their apabilities by the underlying omputing arhiteture and they tend to be expensive due to the speialized nature of the solutions they employ. Our goal in this work is twofold: (i) To understand the behavior of this lass of emerging medial appliations in order to provide an effiient parallel implementation and (ii) to introdue a new benhmark for parallel omputer arhitetures from a novel and important lass of appliations. We address the limitations faed by modern ultrasound systems by investigating how all proessing required by advaned beamforming algorithms an be performed on modern lusters of high-end PCs onneted with low-lateny, high-bandwidth system area networks. We investigate the omputational harateristis of a state-of-the-art algorithm and demonstrate that today s ommodity arhitetures are apable of providing almost-real-time performane without ompromising image quality signifiantly. Keywords Parallel proessing, Medial appliations Categories and Subjet Desriptors C.3 [Computer Systems Organization]: Speial-Purpose and Appliation-Based Systems Permission to make digital or hard opies of all or part of this work for personal or lassroom use is granted without fee provided that opies are not made or distributed for profit or ommerial advantage and that opies bear this notie and the full itation on the first page. To opy otherwise, to republish, to post on servers or to redistribute to lists, requires prior speifi permission and/or a fee. ICS 2, June 22-26, 22, New York, New York, USA. Copyright 22 ACM /2/6...$5.. General Terms Algorithms, Performane 1. INTRODUCTION Major efforts have been devoted reently in improving non-invasive, high-preision diagnosti methods, a ritial omponent in the renewed effort to enhane health servies. One of the researh diretions taken to address suh requirements is the development of high-resolution, digital, three dimensional (3D) ultrasound medial imaging systems. With the advent of high performane omputing failities and the availability of transduer rystal tehnology, ultrasound imaging systems have emerged as effiient methods to extrat and reprodue relevant medial diagnosti information. Sanned 3D Volume Ultrasound Probing Input Time Series Beamformer Reonstruted 3D Volume Figure 1: The omponents of a beamforming-based ultrasound system. Generally, a digital ultrasound system onsists of a set of sensors [25, 11, 2] that perform data aquisition and a bakend omputing arhiteture, responsible for proessing the raw data and reonstruting the ultrasoni images [24]. Figure 1 depits a typial ultrasound system. The ultrasound probing apparatus onsists of a set of sensors and a data aquisition unit that probe the objet under onsideration and gather the sampled data. The beamformer is based on some omputing struture and performs signal proessing of the samples in order to reonstrut the image of the sanned objet. In this framework, both the probing unit as well as the omputing arhiteture omponents are important in delivering good quality diagnosti results. Current limitations in sensor tehnologies neessitate the usage of omplex signal proessing engines to improve image quality. Most ultrasound medial systems suffer from poor image resolution. Some of these limitations an be attributed to fundamental physial aspets of the ultrasound transduer and the interation with the tissue. Advaned signal proessing algorithms an enhane image resolution and detetion quality as well as minimize the relevant prob-

2 ing hardware requirements leading to ost effetive ultrasound system tehnologies. However, the urrent state-ofthe-art in high-resolution, digital, 3D ultrasound medial imaging faes two main hallenges. First, the ultrasound signal proessing strutures used are omputationally demanding. Traditionally, speialized omputing arhitetures and hardware have been used to provide the levels of performane and I/O throughput required, resulting in high system design and ownership osts. With the emergene of high-end workstations and low-lateny, highbandwidth interonnets [1, 3, 6], it now beomes interesting and timely to investigate if suh tehnologies an be used in building low-ost, high-resolution, 3D ultrasound medial imaging systems. Seond, although beamforming algorithms have been studied in the ontext of other appliations [4], little is known about their omputational harateristis with respet to ultrasound-related proessing, and medial appliations in general. It is not lear whih parts of these algorithms are the most demanding in terms of proessing or ommuniation and how exatly they an be mapped on modern parallel arhitetures. In partiular, although the algorithmi omplexity of different setions an be alulated, little has been done in terms of atual performane analysis on real systems. The lak of suh knowledge inhibits further progress in this area, sine it is not lear how these algorithms should evolve to lead to appliable solutions in the area of ultrasound medial imaging. In this work we address both of these issues by designing an effiient parallel beamforming algorithm and studying its behavior and requirements on a generi omputing arhiteture that onsists of ommodity omponents. First, we review the signal proessing algorithm used in the implementation of a 3D ultrasound medial imaging system we are urrently building. We provide an effiient, all-software, sequential implementation that shows onsiderable advantages over hardware-based implementation of the past. We then provide an effiient parallel implementation of the algorithm for a luster of high-end PCs onneted with a lowlateny, high-bandwidth interonnetion network and study its behavior. The emphasis is on the omputational harateristis of the algorithm and the identifiation of parameters that ritially affet both the performane and ost of our system. We study the behavior and performane of the algorithm for a wide set of parameters and we reveal a number of interesting harateristis leading to onlusions about the prospet of using ommodity arhitetures for performing all related proessing in this family of medial appliations. Our high level onlusions and ontributions are: (i) A 16- proessor system today an ahieve lose-to-real-time performane for high image quality and is ertainly expeted to do so in the near future. (ii) Only small parts of this family of signal proessing algorithms are very omputationally intensive. In partiular, 85-98% of the time is spent in FFT and beam steering funtions for all our runs and most of the runs spent between 92-95% in these funtions. (iii) The ommuniation requirements in the partiular implementation are fairly small, loalized, and ertainly within the apabilities of modern low-lateny, high-bandwidth interonnets. (iv) Our results provide an indiation of the amount of proessing required for a given level of image quality and an be used as a referene in designing omputing arhitetures for ultrasound systems. The rest of the paper is organized as follows: Setion 2 provides a bakground for ultrasound systems. Setion 3 presents a omprehensive summary of the family of onventional beamforming algorithms we use. Setion 4 desribes the platform we use for our experiments. Setion 5 desribes our sequential and parallel implementations of the algorithm. Setion 6 desribes our methodology and Setion 7 presents our experimental results. Finally we present related work in Setion 8 and draw onlusions in Setion ULTRASOUND SYSTEMS Ultrasound medial imaging is one of the most widely used imaging modalities in the area of health servies. Ultrasound systems an be used for early diagnosis, sreening, monitoring, and minimally-invasive follow up proedures. The ultrasound image quality has dramatially improved the last few years mostly due to the omplete elimination of the analogue eletronis and the introdution of digital beamforming tehniques [24]. Although there is a large base of installed systems and numerous hardware platforms already in use, the majority of these systems share ommon harateristis. In the near future, pratially all ultrasound systems will utilize signal proessing tehniques to proess signals reeived as a result of the stimulation of the tissue. Suh ultrasound systems follow the general struture shown in Figure 1. The system s quality is determined by both the physial harateristis of the system as well as by the signal proessing algorithm used to proess the signals. The transduers used in suh ultrasound systems are based on phased array transeiver tehnology. They onsist of an array of transeivers whih an be aligned in a speial geometri onfiguration suh as linear, irular, planar, ylindrial, or spherial. The purpose of the phased array transduer is to exploit the superposition of waves radiated by the individual transeiver of the array s transduer. The ability to ontrol the phase and the amplitude of the ultrasound waves emitted by eah individual transeiver allows the angular steering of the radiated beams that are used to illuminate a volume of interest. After transmitting the sound waves, the ultrasound system omes to the reeiving mode. As the sound waves penetrate the volume and enounter objets, refletion ours. The refleted waves as well as their multi-path versions are reeived and digitized by the ultrasound mahine. A ertain segment of the digitized signals is proessed by the beamformer, resulting in disrete time series of a ertain length. A digital beamformer is a spatial filter that proesses data from the array of sensors in order to enhane the signal reeived from a ertain diretion reduing the interferene of the bakground noise. Beams are then ombined together to form the spatial images of 2D or 3D volumes. Next, we desribe in more detail the speifi beamforming algorithm we use in our work. 3. BEAMFORMING ALGORITHM The beamforming algorithm we use in our work is based on the onventional beamforming algorithm [24]. 3D planar-phased-array beamformers use multiple beams to san a 3D volume. The volume is reonstruted by using the transeivers outputs of the planar array transduer. Eah beam is haraterized by its angular diretion in the

3 sanned volume. Figure 2 shows how eah beam is speified. The thik arrow depits a beam in the diretion (θ, φ) in the spherial oordinate system. The enter of the oordinate system is the enter of the (N M) planar array that lies symmetrially on the (, Y ) plane, where N and M denote the number of sensors in eah row and olumn of the array, respetively. The rows and olumns of the array are aligned in parallel with the, Y axis. With eah pair of angles (θ, φ) we assoiate another pair of angles (A, B) that are used in the algorithm to haraterize a beam. A is the angle between the beam and the axis and B is the angle between the beam and the Y axis. The boundaries of the volume reonstruted by eah beam are speified in terms of these angles A and B (Figure 2). For example, a reonstruted volume is speified to be within 7 A, B 11. The number of beams is speified as a b, wherea and b are the numbers of beams in A and B angular diretions. The beam width is defined to be the angular width that is overed by a single beam and haraterizes the image resolution apabilities of the image reonstrution proess. The more beams and narrow beam width the beamformer uses the better quality images it an generate, however, at higher omputational osts. Z O φ θ A B R Figure 2: The representation of a beam in the spherial oordinate system. To produe sharp images of the input objet, the sanned volume is divided in multiple foal zones. Ultrasound systems use various foal depths for the reonstruted volume. Eah foal zone Z R is entered around a foal depth R. For example, the zone of depth between 1 m and 2 m is reonstruted using a foal depth R =1.5 m. To produe an image that is foused over the whole reonstruted region, the beamforming proess is repeated for eah foal zone (haraterized by a different value of R). Narrower foal zones produe sharper images but result in more proessing as well. In most pratial appliations, the foal zone size should be in the range between.5 and 1. m. In our algorithm we use uniform spaing for R. The beam time series at the output of the beamformer for a speifi R is trunated and only the segment that overs the foal zone Z R under onsideration is proessed. The images of the different foal zones Z R are then onatenated to form the whole volume. The volume is reonstruted from the time samples of eah beam and foal zone as follows. Sine the reeived aousti wave is oming from a point soure loated at a finite distanefromthearray,thewavefront is radial. Therefore, the arriving waves are not simple plane waves, but rather spherial waves as being refleted by an objet and due to the separation of the transeivers in the array, they arrive at different transeivers with slight time delays. If we assume that the distane of the m th transeiver on the axis from Y origin point O is x m and the distane of the n th transeiver on the Y axis from the same origin point is y n,thenwean ompute the time delay between these two transeivers as: R2 + x t x = 2 m 2Rx m os A R (1) and p R2 + yn t y = 2 2Ry n os B R (2) where is the speed of sound in human tissue. Thus, for a beam with foal depth of R, the 3D angular response of a N by M planar array to a steered diretion (A x,b y)anbe expressed as: B(f i,a x,b y,r)= M 1 N 1 m= n= I m,n(f i)s m,n(f i,a x,b y,r) (3) where I m,n(f i) is the Fourier transform of the input time series from the (m,n) transeiver: I m,n(f i)=fft(i m,n(t i)) (4) and S m,n(f i,a x,b y,r) is the steering vetor applied to ompensate for the time delay of the (m,n) transeiverwith respet to the referene point (loated at the enter of the planar array): S m,n(f i,a x,b y,r)= e j2πf i R 2 +x 2 m 2Rxm os Ax R + R 2 +yn 2 2Ryn os By R The equation for the angular response (3) an be simplified by separating the term in the steering vetor S m,n as follows B(f i,a x,b y,r)= where N 1 n= S n(f i,b y,r) S m(f i,a x,r)=e j2πf i S n(f i,b y,r)=e j2πf i h M 1 m= R 2 +x 2 m R 2 +y 2 n i I m,n(f i)s m(f i,a x,r) 2Rxm os Ax R 2Ryn os By R The summation term in square brakets in equation (6) is equivalent to the response of a line array beamformer along the axis. If we let all the steered beams from this summation term form a vetor denoted by B n(f i,a x), then equation (6) an be rewritten as: B(f i,a x,b y,r)= N 1 n= (5) (6) (7) (8) B n(f i,a x)s n(f i,b y,r), (9) whih expresses a linear beamforming along the Y axis with B n(f i,a x) as input.

4 Equation (9) suggests that the 2D planar array beamformer an be deomposed into two linear array beamforming steps. The first step inludes a line array beamforming along the axis and will be repeated N time to get the vetor B n(f i,a x). The seond step onsists of line array beamforming along the Y axis and will be done only one by treating the vetor B n(f i,a x) as the input signal for the line array beamformer to get the output B(f i,a x,b y,r). The deomposition of the planar array beamformer into these two line array beamforming steps leads to an effiient implementation based on the following two fators [24]: First, the number of the involved transeivers for eah of these line array beamformers is muh smaller than the total number of transeivers, M N, in the planar array. This kind of deomposition proess for the 3D beamformer redues both memory and CPU requirements. Seond, all line array beamformers an be exeuted in parallel resulting in high degree of oarse-grain parallelism. Finally, we should note that the number of sensors used in a transduer array is an important parameter for an ultrasound system. Detetion of an aousti signal in a noise field is haraterized by the array gain (AG) parameter that is usually defined as: AG =1log(M N BIN) 2 where M N is the number of sensors and BIN is the number of frequeny bins used in the beamforming (or the FFT size as explained later). The more sensors used in a sensor array, the higher is the array gain. The array gain indiates the strength of a beamformer in deteting refleted ultrasound signals. When an objet is viewed as a olletion of refletive point soures, a beamformer with higher array gain an produe sharper images for the individual point soures. 4. EPERIMENTAL PLATFORM Our final ultrasound system follows the overall struture shown in Figure 1. The omputing arhiteture we will use is a modern luster of high-end PCs. Eah node will be equipped with a PCI data aquisition ard that will onnet thenodetoasubsetofthesensorarray. Thedataaquisition ards will deliver the probing data from the sensor array to the orresponding node s memory. The beamforming algorithm will then reonstrut the image of the sanned objet, redistributing data as appropriate. The purpose of this work is to examine the proessing omponent of the system, after the sampled data have been plaed in the main memory of eah node. The experimental system we use for evaluation is a luster of 16 2-way Pentium III nodes interonneted with a Myrinet network [3]. The exat system onfiguration is summarized in Table 1. Myrinet is a low-lateny, high-bandwidth, point-to-point system area network (SAN), used widely for lusters of workstations and PCs. By allowing users to diretly aess the network, without operating system intervention, Myrinet and other SANs dramatially redue latenies ompared to traditional TCP/IP based loal area networks. Moreover, to further redue latenies in SANs, diret memory operations are usually supported; reads and writes to remote memory are performed without remote proessor intervention. Eah network interfae in our system has a 133 MHz programmable proessor (LANai9) and onnets the node to the network with two unidiretional links of 16 MByte/s peak bandwidth eah. Atual node-to-network bandwidth is usually onstrained by the 133 MBytes/s I/O bus on whih the NIC sits. All system nodes are onneted with a 16-port full rossbar Myrinet swith. Proessors 2 x Intel Pentium III, 8 MHz Cahe 32K (L1), 512K (L2) Memory 512MB SDRAM OS RedHat Linux Kernel smp PCI buses 32 bits, 33 MHz NIC Myriom M3M-PCI64B Communiation library MPICH/Sore 4. Table 1: Cluster node onfiguration. The ommuniation layer we use is the Message Passing Interfae (MPI) on top of the SCore system [14]. SCore is a high-performane parallel programming environment for workstation and PC lusters. SCore relies on the PMv2 [26] low-level ommuniation layer. The MPI implementation we use is a port of the MPICH library [19] for the SCore system. Figure 3 shows the bandwidth and lateny of the basi, un-ontended MPI Send and MPI Rev operations. We obtain these point-to-point numbers from running the SKaMPI (Speial Karlsruher MPI-Benhmark) benhmark on our system [23]. In all our experiments we use the g ompiler, version , with the -O2 optimization level. 5. ALGORITHM IMPLEMENTATION Our implementations of the algorithm outlined in Setion 3 assume that sampled data has already been plaed in the main memory of eah node by the aquisition units. Next, we present our, in-house, sequential and parallel implementations of the 3D beamforming algorithm. void main() { reate_filter(bf_filter); reate_steeringvetor(stv); //for eah tile (pr,p) for(int pr = ; pr < ROW; pr++) { for(int p = ; p < COL; p++) { read_data(buffer_in[chc][chr][num_freq]); while (zone < NZONES) { FFT(buffer_in, fft_out); Filter(fft_out, bf_filter); while(fft samples >= zones samples) { for(xb = ; xb < xbeams; xb++) { C_Steer(fft_out, STV, az_out); for(yb = ; yb < ybeams; yb++) { R_Steer(az_out, STV, bout); IFFT(bout); Write_to(buffer_out, bout); display(buffer_out); Algorithm 1: Pseudo-ode for the sequential implementation. 5.1 Sequential Implementation Our sequential algorithm for performing the omputation

5 Bandwidth (MB/s) Lateny in miroseonds k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1M Message Size (byes) k 2k 4k 8k 16k 32k 64k 128k Message Size (byes) Figure 3: Ping-pong bandwidth (left) and one-way lateny (right) for a pair of nodes using (MPI Send, MPI Rev) to send and reeive data. outlined in Setion 3 onsists of the following phases: read input samples, ompute FFTs, filter results, perform olumn steering, reorganize data in memory, perform row steering, perform inverse FFTs, and, finally, output to display. Algorithm 1 shows the pseudo-ode for this implementation. In addition to dividing the sanned volume to multiple foal zones and using multiple beams to san it, beamforming algorithms divide the 2D surfae to be sanned in multiple tiles. If we view the 2D surfae as an array of points, we an divide the rows and olumns in ROW, COL groups forming ROW COL tiles. Eah tile is sanned by the full planar sensor array. Thus, CHC and CHR represent the number of sensors in eah dimension of the planar array. Every full snapshot (that generates a single full 3D frame of the sanned volume) requires sanning all tiles and foal zones. For real-time proessing we would require at least 1 frames (full snapshots) per seond and ideally 2 to 3. To re-reate the depth information the algorithm proesses the data based on a number of foal zones (NZONES). Eah foal zone is of onstant width (depth), whih depends onthedepthofthevolumetobesannedandthenumber of foal zones. The volume depth is usually onstant and defined by the type of objets the ultrasound will san. For instane, different human organs require different san depths. In this work we use a fixed maximum foal depth of 16m and we vary only the number of foal zones. For eah sanned point of the input volume the program reads the time series data from the orresponding sensor (that may san multiple points) and stores it in a buffer buffer in[chc][chr][num FREQ] in host memory. Eah sample is 32 bits and is represented as a single preision floating point number. The number of samples that need to be proessed is ditated by the depth of eah foal zone. The probing signal used to san the objet reahes different depths of the sanned volume with different delays. The sampling rate used to digitize the reeived signal ditates the minimum number of samples (and the minimum FFT size) required to reonstrut depth information. For example, assuming a sampling frequeny of F s =3MHz,thenumberofsamples needed to reonstrut 2 m depth an be omputed by N = 2dFs = = 7792, where N is the number of read in samples, is the speed of sound in meters per seond, d is the depth of the reonstrut area in meters. The fator of two is need to aount for the round trip time. Thus, in this example the ultrasound system (aquisition unit) needs to provide the beamformer with 7792 samples for eah of the sensor time series of the ultrasound probe. Using more than the minimum number of samples an improve the array gain and result in better quality imaging. However, reading in more samples results not only in more proessing but also in longer aquisition times and higher storage requirements. In our experiments we set the number of samples to 8K and instead we vary the size of the FFT operations. After the time samples are read and onverted to frequeny domain, a filtering phase is used to redue the amount of information passed to later stages. The information embedded in the reeived signals neessary for reonstruting a partiular depth region is loalized in a ertain bandwidth. Using only the relevant frequeny omponents further redues omputational time. Thus, the FFT output samples are filtered (with a Finite Impulse Response (FIR) filter [22]) to exlude unneessary information and the related proessing. The bandwidth depends on the foal depth and the enter frequeny of the ultrasound pulses. Lower frequeny signals usually have better penetration into deeper regions, whereas, higher frequeny signals produe sharper beam resolutions. However, enter frequenies are usually fixed for eah depth. Thus, in this work we use 2 MHz as the enter frequeny (for an input volume with maximum depth of 16 m). In the appliations we are interested in, most objets (human organs) would fall within this range. Given this enter frequeny the bandwidth of the filter an vary in the range.5 4. MHz. After filtering, the beamformer performs the steering operations and finally samples are onverted bak to the time domain for displaying. Based on equation (6), the xb and yb loops proess the steering of beams on the and Y axis separately using the pre-alulated steering vetor STV to align the time delay of the signals arriving in different sensors and IFFT transforms the signal from the frequeny to the time domain. 5.2 Optimizations To gain onfidene that we start from an effiient sequential implementation, before proeeding with parallelization, we perform a number of measurements to fine tune several aspets of our sequential implementation. First, we explore various FFT implementations, both our own and publily available. We find that, for the proessors we use, the most effiient implementation is FFTW [8], a C

6 void main(int arg, har** argv) { reate_filter(bf_filter); reate_steering_vetor(stv); for (eah frame tile) read_data(buffer_in); for (eah fft-size samples) FFT(buffer_in, fft_out); Filter(fft_out, bf_filter); for (eah foal zone) for (eah proessor pro < NPROCS ) for (eah x-axis beam < xbeams/nprocs) C_Steer(fft_out, STV, az_out_send); // redistribute data among nodes if(pro!= My_rank) { MPI_Irev(az_out_buf[pro], sendsize, MPI_FLOAT, MPI_ANY_SOURCE, Tag, MPI_COMM_WORLD, &rev_req[omm_ount]); MPI_Send(az_out_send, sendsize, MPI_FLOAT, pro, Tag, MPI_COMM_WORLD); else { mempy(az_out_buf[pro], az_out_send, sendsize*sizeof(float)); MPI_Waitall(NPROCS-1, rev_req, rev_stat); datareordertransformation(az_out_buf, az_out); for (eah x-axis beam < xbeams/nprocs) for (eah y-axis beam < ybeams) R_Steer(az_out, STV, bout); IFFT(bout); write_time_serial_data(buf_out, bout); reate_display_data(buf_out); Algorithm 2: Pseudo-ode for the parallel implementation. library for omputing disrete Fourier transforms. Sine the input time series data are real numbers we use the real onedimensional FFT funtion rfftw(). This also minimizes spae requirements sine the output of this funtion is a half-omplex array that onsists of only half the DFT amplitudes; The negative-frequeny amplitudes for real data are the omplex onjugates of the positive-frequeny amplitudes. The side effet of this is that we need to reorganize the output to a ommon, full-omplex array format after the FFT and revert to the half-omplex array before the IFFT. Also, FFTW omputes an un-normalized transform for the input signal (IFFT(FFT(x)) = N x) forsizen transforms. Thus, a division by N is needed for eah element of the array after the final IFFT. The plan argument to rfftw() is onstant aross invoations and an be preomputed. Seond, we experiment with multiple ways of performing the steering and the related dot produt operations. We notie that the inner summation in equation (6) is atually the summation of M and N omplex numbers whih are the results of the omplex multipliations I m,n (f i) and S m(f i,a x,r), respetively. We find that the best results are obtained by using the blas dotu sub() funtion from the Intel Math Kernel Library (MKL) [12] to ompute the neessary dot produts and to perform the steering. MKL is optimized for the Pentium family of proessors and makes effetive use of the Matrix Manipulation Extensions (MM) [21], SSE (Streaming SIMD Extensions) [13], and similar instrutions. Third, we tune loop ordering and the layout of multidimensional array data strutures to improve memory aess behavior and to redue ahe misses. The overall effet of these optimization steps is a redution of the overall exeution time of the sequential implementation by a fator of about 1. It is a surprising result that hand-tuning an be so effetive with all ompiler optimizations turned on. However, sine in this work we are more interested in the behavior of the parallel version we omit these results due to spae limitations. 5.3 Parallel Implementation The parallel version of the beamforming algorithm follows losely the struture of the sequential implementation. We see that the data read from eah sensor is proessed independently until steering. Then, during the steering phase, the beams aross the and Y diretions are independent. Therefore, we hoose to divide the omputation in two phases. The first phase inludes all proessing until after the olumnsteering phase. The seond phase inludes the rest of the proessing, starting at the row-steering phase. Between the two phases, we need to reorganize the data in memory by performing a matrix transpose, whih results in an all-to-all ommuniation pattern. The first phase of the omputation for eah frame is deomposed in tasks based on the data generated by eah sensor. Thus, there is as many tasks as sensors (e.g ), whih is suffiient for systems with large numbers of proessors. The tasks for the seond phase are determined by the proessing assoiated with eah beam. We use the proessing related to a single beam as the basi task and we deompose the seond phase to xbeams ybeams tasks. For instane, with 8 beams in eah diretion, there are 64 oarse grain tasks. We expet that for all pratial appliations, at least 8 8beamswillbeneessaryandthuswedonot onsider ases where the number of proessors is larger than the total number of beams. We experiment with two implementations of the parallel algorithm. The first implementation uses dediated nodes for eah phase. However, we find that balaning the number of nodes between the two phases of the omputation depends on a large number of parameters. Thus, we provide a seond, symmetri, implementation as well, where all nodes in the system perform the same type of proessing. Although, the first, dataflow approah has ertain advantages in reduing task management osts, we find that the seond, SPMD approah is more flexible and results in better load balaning. Thus, for the rest of this work we only use our symmetri implementation, as shown in Algorithm EVALUATION METHODOLOGY The goal of our work is twofold. We are interested in evaluating the absolute performane of this family of algorithms on lusters of generi, ommodity omponents. In addition, we aim to understand the omputational harateristis of this emerging lass of appliations. Sine the data aquisition unit of our system is not available yet and there are no publily available data from atual systems (due to privay and other onstraints), we use the

7 Field II ultrasound simulator [15] to generate the input samples for our experiments. The input to the Field II simulator is a point-model of a shell objet. For our experiments we use a shell of 5,652 points. The exat simulator parameters for generating the input time-domain signals are shown in Table 2. Transmit Center Frequeny 2.MHz Bandwidth 2.MHz Array Size Detetor Size.35mm Detetor Spaing.4mm Transmit Foal Depth 7mm Reeive Array Size Detetor Size.35mm Detetor Spaing.4mm Reeive Foal Depth Infinite (1 22 m) Sampling Frequeny 33MHz Shell Inner Shell Radius 1mm Outer Shell Radius 14mm Shell Center (5mm, -5mm, 7mm) Shell Thikness 4mm Points Defining Shell 5,652 Satter Density 6.93 pts/mm 3 Table 2: Input parameters for the Field II simulator. To investigate how eah system parameter affets the exeution time of the algorithm in pratie, we examine the most important system parameters for beamforming-based ultrasound systems. Table 3 summarizes these parameters, their allowable value ranges, and the values used in our experiments. Eah parameter is attributed either to the ultrasound system itself (physial) or the beamforming algorithm (algorithmi). First, we verify that parameters are (for all pratial purposes) independent of eah other by performing guided experiments (whih we do not present here due to spae limitations). Thus, we vary eah parameter individually by keeping all other parameters onstant. The base value and the range we use for eah parameter is shown in Table 3. To make the effet of varying eah parameter as visible as possible we use as the base ase, values that result in relatively low amounts of omputation. We denote eah onfiguration with the notation a{sensors-b{beams-f{fft-sp{foal -bw{bandwidth, wheresensors is the number of sensors in eah dimension of the planar array, beams is the number of beams in eah tile of a snapshot, FFT is the FFT size, foal represents the foal zone size in millimeters, and bandwidth is the bandwidth of the filter in MHz. For example, the base onfiguration, denoted as a32-b8-f512-sp1-bw2. speifies a onfiguration of sensor array, 8 8 beams per tile, 512 FFT size, 1 mm depth for eah foal zone, and 2. MHz filter bandwidth. It is important to note that hanging eah parameter impats not only exeution time, but image quality as well. Thus, it is important to be able to quantify image quality and to also take it into aount when evaluating various onfigurations. One traditional method of quantifying image quality is to orrelate eah generated image with the prototype that is being sanned, and to use the orrelation number for ranking output images. In our ase, however, sine the input time series is generated with the Fields II simulator, there is no atual input objet or image. For this reason, we use as the prototype image the best possible image that the algorithm an generate (a32-b16-f496-sp5- bw4. ). We orrelate eah pair of images by using the same oeffiient that is used in statis to express the degree of dependene between two variables [2]. In our ase, eah variable orresponds to the pixel value of eah image. Although it is somewhat simplisti to ompare two linial images just by using the orrelation oeffiient (without expert opinion from medial personnel), we still get a good indiation of the relative quality of various images. We perform various onsisteny heks to verify that the orrelation oeffiients orrespond, to the extend possible, to the pereived quality of eah image and we feel reasonably onfident that our ranking methodology is valid. In our measurements we exlude the initialization time and we present measurements only for the parallel setion. Moreover, as mentioned earlier, we assume that input data is delivered to memory by the data aquisition ards. Although these transfers may interfere with other ommuniation in the system, it is not an issue sine overall traffi is low (as we will see in Setion 7). It is important to note that, although we do not evaluate this aspet of our system, one of the advantages of using a luster to proess the input data, is that the I/O path bandwidth sales linearly as we inrease the number of nodes in the luster. Finally, in our measurements we exlude the time needed to send the proessed data from eah node to a separate node that displays the images. However, the amount of ommuniation required is very small and ours over a separate 1 MBit Ethernet network. For the parallel setion of the algorithm, we present both overall exeution times as well as exeution time breakdowns. To reveal whih parts of the algorithm inur high overheads we break exeution time to the following omponents: FFT is the total time spent performing FFTs on the input samples. Filter is the time filtering frequenies that are outside a pre-speified range. Csteer is the time spent steering the samples orresponding to the olumn sensors. Communiation is the time spent redistributing the data among the system nodes. For the uniproessor ase, ommuniation time represents the time to transpose the array loally. Rsteer is the time spent steering the data that orrespond to eah sensor row. IFFT is the time spent performing inverse FFTs. Finally, Other represents the time spent in the rest of the parallel setion of the algorithm. 7. RESULTS In this setion we first present our overall performane results and then we examine the effet of eah individual parameter separately. 7.1 Overall Exeution Time Figure 4 shows the total exeution time of the parallel setion of eah onfiguration as the number of proessors hanges. We see that exeution time redues linearly with the number of proessors (the x-axis uses a log sale). This is mainly due to the fat that the partitioning of the tasks is well balaned and the fat that the amount of ommuniation between the two phases of the parallel algorithm is relatively small. The message size depends on the number of the proessors, the enter frequeny of the ultrasound

8 Parameter Charateristis Range Values used Number of sensors Physial m m(m =32, 24, 16, 8) 32 32, 24 24, 16 16, 8 8 FFT size (samples) Algorithmi # of samples 512, 124, 248, 496 Filter bandwidth (MHz) Algorithmi [.5, 4.].5, 1., 1.5, 2., 2.6, 3., 4. Foal zones size (m) Algorithmi 1..5, 1. Number of beams per 1 1 Algorithmi n n (n 16) 16 16, 8 8, 4 4 Table 3: Algorithm parameters, valid ranges for eah parameter, and the values we examine. Highlighted values indiate the base value for eah parameter. Time for 1 frames (se.) Number of Proessors a32-b8-f496-sp1-bw2. a32-b8-f248-sp1-bw2. a32-b16-f512-sp1-bw2. a32-b8-f124-sp1-bw2. a32-b8-f512-sp1-bw4. a32-b8-f512-sp5-bw2. a32-b8-f512-sp1-bw3. a32-b8-f512-sp1-bw2.6 a32-b8-f512-sp1-bw2. a32-b8-f512-sp1-bw1.5 a32-b8-f512-sp1-bw1. a32-b4-f512-sp1-bw2. a32-b8-f512-sp1-bw.5 a24-b8-f512-sp1-bw2. a16-b8-f512-sp1-bw2. a8-b8-f512-sp1-bw Effets of system parameters We now desribe the effets of different parameters on exeution time. Sine it is important to also onsider the effet on image quality, we also present the orrelation rankings for eah final (output) image. Time for 1 Frames in Seond Other IFFT Rsteer Comm. Csteer Filter FFT Figure 4: Speedups for different parameter sets by the number of proessors. signal, and the bandwidth of the filter. In our experiments message sizes vary between 112 bytes and 127K bytes. On average, the total amount of data exhanged between the two phases for eah frame is about 1 MByte. This imposes fairly small bandwidth requirements on the interonnet and is well within the apabilities of modern system area networks. Next, we note that the proessing rate for our base ase, a32-b8-f512-sp1-bw1., is about 2 frames/s. The onfiguration with the least amount of proessing, a8-b8-f512- sp1-bw2., an generate about 5.5 frames/s with aeptable quality. Although this is still less than what is needed for real-time performane (ideally, for real-time performane we would require a rate of 2-3 frames/s), using faster proessors that are already available would offer 2-3 times better performane today and real-time performane within a few months. Preliminary runs on a 8-node luster with 2. GHz Pentium4 proessors, show that the average speedup ompared to our 8 MHz proessors varies between 1.6 and 3.9 for an average of about 2.3 aross all onfigurations. For our base onfiguration, a32-b8-f512-sp1-bw2., there is a speedup of about 2.1. The speedup on Csteer is about 4., whereas the speedup on FFT and IFFT is between 1.5 to 1.7. Finally, our fastest onfiguration, a8-b8-f512-sp1-bw2., exhibits a speedup of about 3.2 over our 8 MHz luster and results in a final speed of about 9 frames/s. Given the linear speedups we observe on our 16-node luster, we expet that using sixteen 2. GHz nodes will result in real-time performane for onfiguration with aeptable or even high image quality. 32x32 24x24 16x16 8x8 32x32 24x24 16x16 8x8 32x32 24x24 16x16 8x8 32x32 24x24 16x16 8x8 Number of Sensors P1 P2 P4 P8 Figure 5: Exeution time breakdowns for different numbers of sensors Number of Sensors Figure 5 shows the exeution time breakdown as we vary the number of sensors. All other parameters are set to their base values, b8-f512-sp1-bw2.. We observe that the overall exeution time is almost linear with the total number of sensors and the number of proessors. Next, we observe that the time spent in eah setion of the algorithm redues linearly with the number of sensors, exept for IFFT whih is independent of the number of sensors and remains onstant. Figure 6 shows the orrelation oeffiient for eah output image as the number of sensors is redued (the number of proessors does not affet image quality). We see that the image quality degrades signifiantly as the number of sensors drops; The best and the worst ases differ by more than 15%. However, we should note that whether this redution in image quality is aeptable for an appliation, depends on the speifi appliation. For instane, if the objets to be sanned are fairly simple, then the drop in quality may be aeptable, whereas for objets that have more omplex ontours this may not be the ase Number of beams Varying the number of beams in the algorithm affets

9 No. of Sensors 32x32 24x24 16x16 8x8 No. Of Beams FFT Size Foal Size 5 1 Bandwidth Correlation Coeffiient Figure 6: Image orrelation oeffiient with different parameter sets. The table shows the x-axis values for eah urve. only the steering operations, the amount of ommuniation, and the inverse FFTs. Figure 7 shows that both row-based steering and IFFT times are redued super-linearly. This is due to the fat that below a ertain number of beams the amount of information to be proessed fits ompletely in the L2 ahe. This suggests, that both larger L2 ahes an be helpful for this lass of appliations as well as appliation knowledge that an be used to limit the number of neessary beams. The orrelation oeffiients (Figure 6) show that image quality degrades only if the number of beams is redued to less than 8. In our experiments, eah frame overs an area with an angle of 1 1. For the beam frames, eah beam overs an angle of.625. To over the same area, the 4 4 beams snapshot has an angle of 2.5 for eah beam, whih is a lot oarser than using beams. Our results suggest that for objets of similar omplexity to our input, image quality degrades signifiantly only if eah beam sans more than FFT size Figure 8 shows the exeution time breakdowns for different FFT size and number of proessors. We notie that the overall time spent in FFTs redues slightly with the FFT size. Although, smaller FFTs result in larger numbers of FFTs and for the sizes we onsider the L2 ahe is always effetive, smaller FFTs tend to be more effiient. We also note that FFT time redues sub-linearly with the number of proessors. Finally, we note that the size of the FFTs affets signifiantly olumn and row steering, ommuniation, and inverse FFT times that all redue linearly with FFT size. Figure 6 shows that FFT size has pratially no influene on image quality for the input we use. The reason for this is that the time samples have a high gain even for small FFT sizes and the algorithm is able to reonstrut a sharp image of the input. We expet that this behavior will hange when we use input objets with more omplex ontours in the atual system. However, these results indiate that understanding the appliation areas where the ultrasound system is used, an help identify appropriate values for parameters suh as the FFT size and to optimize for Time for 1 Frames in Seond Number of Beams P1 P2 P4 P8 P16 Other IFFT Rsteer Comm. Csteer Figure 7: Exeution time breakdowns for different numbers of beams. system ost, performane, and image quality tradeoffs Foal zone size Similarly to FFT size, the foal zone size affets only the seond phase of the algorithm. Dereasing the foal zone size from 1 to 5 mm doubles the number of foal zones (from 16 to 32) that are required to over the same depth of volume (16m) and inreases the time required for the seond phase of the algorithm linearly (Figure 9). Finally, image quality is not affeted signifiantly by the foal zone size (Figure 6), for reasons similar to what was explained for the effets of the FFT size Filter bandwidth Changing the filter bandwidth affets all aspets of the algorithm, exept for the time spent in FFTs and IFFTs (Figure 1). The orrelation oeffiient for the output images (Figure 6) shows that image quality degrades only if the filter bandwidth drops below 1. MHz. This is due to the physial harateristis of our simulated transduer array. The aousti signal we use has a bandwidth of 2. MHz whih results in useful information being ontained in a 2. MHz bandwidth in the frequeny domain after the Fourier transform of the input samples. Using smaller filter bandwidths exludes some of this information, and bandwidths less than 1. MHz result in signifiant degradation of image quality Summary Overall, we find that our parallel implementation sales linearly with the number of proessors and that we an ahieve almost real-time performane with state-of-the-art lusters. Furthermore, we find that the amount of time spent in FFT and steering operations dominates. In all our experiments, the parallel implementation spends 85-98% of the time in FFT and beam steering funtions and most of the runs spent between 92-95% in these funtions. Finally, the values of different parameters have signifiant impat on the omputational requirements of the algorithm. Thus, appliation knowledge that an help seleting appropriate values Filter FFT

10 Time for 1 Frames in Seond FFT Size P1 P2 P4 P Other IFFT Rsteer Comm. Csteer Filter FFT P16 Time for 1 Frames in Seond Other IFFT Rsteer Comm. Csteer Filter FFT Foal Zone Size (mm) P1 P2 P4 P8 P16 Figure 8: Exeution time breakdowns for different FFT sizes. for these parameters may be important in optimizing future ultrasound systems for ost and performane. 8. RELATED WORK To the best of our knowledge, there is very little work in understanding the omputational harateristis of ultrasound imaging beamforming proessing algorithms on modern lusters. Numerous solutions for the aquisition problem and a large number of algorithms for proessing the sensor time series have been proposed reently [25, 11, 2, 15, 18, 24]. This work has examined, among other, issues related to transduers and their relation to beamforming tehniques for ultrasound systems. Our work is orthogonal to this and relies on high-quality transduer arrays. Also, previous work has examined the usage of beamforming algorithms in ultrasound and other medial appliations [17, 7]. Finally, there has been a large body of work on parallel beamforming algorithms and implementations on both high-end parallel systems and distributed workstations [4, 5, 9, 16, 1]. However, all this work has examined appliations from other domains, and in partiular sonar systems. 9. CONCLUSIONS In this paper we examine a family of algorithms that are used in high-resolution 3D medial imaging systems. We present the neessary bakground, we desribe the fundamental algorithmi aspets, and study the omputational behavior on modern arhitetures. Our work, indiates that for many appliations, speialized arhitetures are not neessary and that generi lusters may be used. In partiular, we see that our implementation of a stateof-the-art beamforming algorithm, by arefully deomposing the original problem, results in linear speedups in systems up to 16 proessors. On a 16-proessor system we an ahieve almost-real-time medial imaging with aeptable or high image quality. Given that we use older-generation proessors, we expet that today s systems (or within a few months) will be able to provide real-time performane, resulting in signifiant flexibility and ost benefits ompared Figure 9: Exeution time breakdowns for different foal zone sizes. to traditional ustom solutions. Preliminary results with 2. GHz Pentium4 proessors show that there is an average speedup of about 2.3 aross onfiguration ompared to our 8 MHz luster. This ability to take advantage of the latest system omponents that beome available with no additional osts for re-designing the system arhiteture is one of the fundamental benefits of our approah in addressing issues in this area. Thus, given our results, we expet that modern lusters will be used with suess in a wider range of medial appliations. We find that FFT and steering osts are the most signifiant overheads and that ommuniation requirements are very low. In most of our experiments, the appliation spends between 92-95% in these setions. Furthermore, we study and reveal how eah setion of the parallel implementation depends on system parameters. We find that most dependenes are linear with small super- or sub-linear effets. In terms of the indued ommuniation, eah proessor exhanges a small number of messages with all other proessors in the ommuniation phase. We use orrelation oeffiients to quantify the impat on image quality and we find that the effets of different parameters on image quality is very diverse and indiates that appliation knowledge is important in optimizing future ultrasound systems for ostperformane. Furthermore, our work indiates that if speialized solutions are neessary, for instane, portable ultrasound systems, system designers an fous on optimizing ertain setions of the algorithm and ignoring the rest. Finally, we expet that, given their advantages over more traditional solutions, modern lusters with low-lateny and high-bandwidth networks will be apable of handling a wide range of medial appliations and they will be instrumental in improving the ost and preision of medial infrastruture. 1. ACKNOWLEDGMENTS We are thankful to Andreas Moshovos for helping with various, uniproessor optimizations of the sequential algorithm. We thankfully aknowledge the support of Natu-

Pipelined Multipliers for Reconfigurable Hardware

Pipelined Multipliers for Reconfigurable Hardware Pipelined Multipliers for Reonfigurable Hardware Mithell J. Myjak and José G. Delgado-Frias Shool of Eletrial Engineering and Computer Siene, Washington State University Pullman, WA 99164-2752 USA {mmyjak,