Three-Dimensional Image Generation and Processing in Underwater Acoustic Vision

Size: px

Start display at page:

Download "Three-Dimensional Image Generation and Processing in Underwater Acoustic Vision"

Donna McDowell
6 years ago
Views:

1 Three-Dimensional Image Generation and Processing in Underwater Acoustic Vision VITTORIO MURINO, MEMBER, IEEE, AND ANDREA TRUCCO, MEMBER, IEEE Invited Paper Underwater exploration is becoming more and more important for many applications involving physical, biological, geological, archaeological, and industrial issues. Unfortunately, only a small percentage of potential resources has been exploited under the sea. The inherent structureless environment and the difficulties implied by the nature of the propagating medium have placed limitations on the sensing and the understanding of the underwater world. Typically, acoustic imaging systems are widely utilized for both largeand small-scale underwater investigations, as they can more easily achieve short and large visibility ranges, though at the expense of a coarse resolution and a poor visual quality. This paper aims at surveying the up-to-date advances in acoustic acquisition systems and data processing techniques, especially focusing on three-dimensional (3-D) short-range imaging for scene reconstruction and understanding. In fact, the advent of smarter and more efficient imaging systems has allowed the generation of goodquality high-resolution images and the related design of proper techniques for underwater scene understanding. The term acoustic vision is introduced to generally describe all data processing (especially image processing) methods devoted to the interpretation of a scene. Since acoustics is also used for medical applications, a short overview of the related systems for biomedical acoustic image formation is provided. The final goal of the paper is to establish the state of the art of the techniques and algorithms for acoustic image generation and processing, providing technical details and results for the most promising techniques, and pointing out the potential capabilities of this technology for underwater scene understanding. Keywords Acoustic vision, array signal processing, biomedical acoustic imaging, image analysis, image processing, image reconstruction, image segmentation, object recognition, pattern recognition, underwater acoustic imaging, underwater technology. I. INTRODUCTION Acoustic imaging is an active research field devoted to the study of techniques aimed at the formation and processing of Manuscript received Sept. 7, 1999; revised July 31, V. Murino is with the Dipartimento Scientifico e Tecnologico, University of Verona, Verona, Italy ( murino@ieee.org). A. Trucco is with the Department of Biophysical and Electronic Engineering (DIBE), University of Genova, I Genova, Italy ( trucco@ieee.org). Publisher Item Identifier S (00) images generated from raw signals acquired by an acoustic system. These techniques are used in typical applications concerning underwater investigation and medical image analysis, and are less exploited for aerial (robotic) applications. In general, all these applications require that the scene under investigation be previously insonified by an acoustic signal, and that the backscattered echoes be acquired by the system (active sensing). In particular, in marine applications, the importance of underwater investigation is nowadays growing, due to the fact that subsea resources have been only slightly exploited despite a great interest in underwater biological, geological, archaeological, and industrial applications in general. Typically, acoustic imaging systems are widely utilized as they allow large- and small-scale explorations, but data quality is not always good, due to noise and coarse resolution; hence, acoustic images are not easily interpreted by inexpert human operators. However, the advent of smarter and more efficient imaging systems has allowed the generation of data of higher quality, thus making it possible to design proper techniques for underwater scene understanding. The general term acoustic vision is introduced here to describe all techniques and methodologies, devoted to the generation and processing of acoustic images for the understanding (segmentation, reconstruction, recognition, etc.) of the observed scenario for either application field. More specifically, in underwater applications, acoustic vision can be defined as the set of algorithms and methods aimed at the localization and recognition of submerged objects (typically, man-made) from computed images, and hence at the reconstruction and interpretation of an underwater scene. Clearly, its range of applicability varies according to the specific sensor used or, rather, to the signal frequency characterizing the sensor. Generally, high frequencies (from about 100 kilohertz to few megahertz) are utilized for a visibility range going from some centimeters to 100 meters, leaving out all applications (at lower frequencies and wider ranges) specifically devoted to seafloor survey and mapping /00$ IEEE PROCEEDINGS OF THE IEEE, VOL. 88, NO. 12, DECEMBER

2 The purpose of this paper is to present a broad survey concerning the generation and processing of acoustic images for underwater applications, especially focusing on algorithms for three-dimensional (3-D) image generation and on image processing techniques for underwater scene interpretation. During the past few years, this field has considerably grown due to the increasing need for better visualizing and, especially, understanding the underwater environment beyond the optical visibility range. Underwater optical vision provides images with finer resolution, but its range is more limited than that of acoustic sensors; moreover, typical optical devices do not provide 3-D information. However, special sensing configuration (e.g., range gated imaging systems) [1] and laser-based approaches (e.g., LIDAR, structured illumination) [2] that have been investigated are able to enlarge the optical visibility and to provide range data, too [3], [4]. These devices are still little investigated, but their potential use in the future is promising. On the other hand, acoustic sensors, other than provide a wider range of visibility, although at coarse resolution, allow one to gain useful information about both two-dimensional (2-D) aspects of possible objects present in a scene and 3-D geometrical features. Unfortunately, the resulting image quality is quite poor, due to strongly degrading speckle noise, the nonideal characteristics of the sensor transfer function, and the used signal wavelength (much larger than that of the light). In fact, although acoustic images are not sensibly affected by the turbidity of water (mainly caused by contamination and suspended sediments), deformations occur if the propagation of sound in water does not follow straight lines due to a sound velocity profile that is not constant. The velocity profile depends mainly on the distributions of temperature and salinity of water, and deformations can be compensated for by measuring such parameters and using a ray-tracing approach for the correction of acoustic images. However, deformations are appreciable only if the distance of the scene is quite large, so they are not considered in the short-range 3-D systems we deal with in this paper. From an application point of view, acoustic vision is an emerging research field with important potential benefits for underwater activities, like robotics, visual inspection and manipulation of submerged man-made structures. To date, unlike 2-D imaging, the practical utilization of 3-D acoustic imaging is limited because of several scientific and technological issues that make it difficult and expensive to produce imaging systems working in real time. However, some recent achievements, in both hardware and software areas, allow us to be confident in a fruitful evolution of the research, as confirmed by the development of some good prototypes (e.g., [5]) and commercial products (e.g., [6]). Finally, acoustic vision is a very important challenge also in applications different from the underwater one. For instance, in the medical field, the acquisition of 3-D acoustic images is one of the most ambitious objectives for the next generation of echographic devices. In this paper, some techniques and algorithms operating on underwater images will be described that can also be proven to be valid in the medical imaging field. The first part of the paper will present concepts and evolutions related to various approaches to the generation of a 3-D image. Like optical systems, acoustic systems can generate an image by processing the waves backscattered from the objects of a scene. The relative easiness of measuring the time-of-flight of an acoustic signal makes it possible to generate not only acoustic 2-D images similar to optical ones, but also range estimates that can be used to produce a real 3-D map. Obviously, to start the process, the scene should be illuminated by the emission of an acoustic pulse. Backscattered echoes can then be processed to create an image of the scene. The whole process can be performed by two different approaches: use of an acoustic lens followed by a retina of acoustic sensors, or acquisition of echoes by a 2-D array of sensors and subsequent processing by adequate algorithms, thus avoiding the need for a physical lens. Such algorithms belong to the beamforming or the holography class. In this paper, imaging systems using an acoustic lens (recently greatly improved), the focused beamforming algorithm, and acoustic holography are presented. To this end, a model of the interaction of the acoustic energy with the scene to be imaged has been developed in such a way that the aforesaid three imaging approaches can be analyzed and compared within a common mathematical framework. Special attention is given to spatial directivity, sidelobe rejection, depth of field, undersampling effects (with respect to the bandwidth), computational and approximation issues, and image representation. Unfortunately, a 2-D array of sensors is mandatory to obtain a 3-D image; consequently, the number of signals to be acquired and processed is often larger than a few thousand. This problem is strictly connected to the computational load of an algorithm and is shared by the three different imaging approaches. However, other problems are shared: the blurring effect due to poor resolution and sidelobes, the speckle noise due to the coherent nature of acoustic imaging, and the specular behavior of man-made objects due to their insufficient roughness with respect to the acoustic wavelength. Some potential solutions proposed over the past years are reported and discussed, although an optimum and definitive 3-D imaging system cannot be devised yet. Also, a panorama of the existing 3-D acoustic systems and of the encouraging prototypes is provided, and real examples of obtained 3-D images are given. The second part of the paper presents image processing methods useful to extract information from the data set so as to recognize the observed scene. An early processing stage, even though simple (i.e., thresholding), is always included in every acoustic imaging system in order to perform basic operations, like detection of the objects (actually, external surfaces) inside an observed scene, and to improve visual understanding by reducing interfering clutters. These methods are quite common and typically used to fast improve image quality, thus allowing the operator to extract meaningful information from the large amount of data acquired by a sensor. Actually, such methods differ according to the various cases considered, and are often adapted to the specific sensor used and to the kind of image representation. More 1904 PROCEEDINGS OF THE IEEE, VOL. 88, NO. 12, DECEMBER 2000

3 complex methods can be applied like filtering and image restoration. Mask filters, like median-like or low-pass-like ones, can be applied to reduce noise effects, and restoration techniques, computationally more costly but more accurate, aim to enhance image quality by modeling the influence of speckle and sidelobes. Geometric corrections are also performed, especially on 2-D data, when the movement of the sensor causes serious cluttering phenomena affecting data quality (long-range low-frequency sonars). After the above-described data processing phase, segmentation and/or reconstruction may be carried out to identify the most significant regions. Recognition or classification methods can subsequently be applied to actually identify the objects of interest in an image, and to visualize them for easier human understanding. A typical image processing scheme is sketched in Fig. 1, where the proposed hierarchical framework can be interpreted, at a coarse resolution, as being subdivided into low-, middle-, and high-level processing, like in classic vision approaches [7] (see also Table 1 for a concise description of the several phases and related techniques). Several approaches are proposed, but standard techniques cannot be identified. However, due to the complex nature of the problem and the heavy presence of noise, statistical techniques are typically used for segmentation, thanks to their capability to model the acoustic image formation process, whereas geometric methods are utilized for reconstruction, thus allowing the subsequent use of pattern-recognition techniques. Several methodologies, derived from feature-based classification, computer vision and artificial intelligence (AI) approaches, are then proposed for higher level tasks, mainly associated with underwater vehicle issues, like navigation and environment modeling. In this paper, we analyze the several approaches presented in the literature to image filtering, segmentation, reconstruction, and object recognition. Major theoretical and procedural techniques will be considered, also providing significant results whenever necessary. The target of this paper is to define up-to-date scientific aspects related to the many acoustic imaging problems that arise from the increasing need for high-quality underwater vision, also and especially over short ranges for robotics applications (e.g., off-shore structure inspection, underwater vehicle navigation, pipeline laying, archaeological research, etc.). In this context, the present work can be seen as resulting from the evolution of two previous surveys, made in 1979, which were partially related to this subject [8], [9]. Therefore, the evolution of acoustic-image formation and processing methods developed during the last 20 years is considered. Given the novelty of the field and the lack of recent comprehensive papers on this topic, the investigation, discussion, and comparison of the above-mentioned methods should be regarded as important features of the paper. The rest of the paper is organized as follows. In Section II, a mathematical framework for the modeling of general 3-D acoustic image generation is proposed. Then, specific methods, namely, beamforming, holographic, and Fig. 1. Hierarchical processing scheme for acoustic image processing and understanding. lens-based techniques, are detailed in Section III. Problems shared by all such methods and some available solutions are discussed in Section IV. Real systems and their features are presented in Section V. Then, 3-D acoustic imaging for medical applications is outlined in Section VI. In Section VII, the formation and representation of an image from raw signals are described, and early (low level) data processing for image enhancement is also discussed. Section VIII deals with higher level tasks, like segmentation, reconstruction, and recognition, each aimed at understanding the content of an acoustic image: statistical, geometric, pattern recognition, and other approaches to acoustic image interpretation are addressed. Finally, conclusions are drawn in Section IX. II. A DATA MODEL FOR 3-D IMAGING To introduce a data model useful to analyze and compare different image-generation approaches, it is important to recall that, in underwater 3-D imaging a typical scene is composed of several solid and continuous objects. One can define underwater acoustic vision as the set of techniques aimed at estimating the shapes and determining the exact positions of the external surfaces of the objects in a scene, disregarding their internal structures. More in general, the techniques aimed at understanding of an acoustic image. A propagating acoustic wave is reflected iff it encounters a change in the acoustic impedance. The acoustic impedance is the proportionality factor between the velocity of the medium particles and the sound pressure; it depends on the density and MURINO AND TRUCCO: THREE-DIMENSIONAL IMAGE GENERATION AND PROCESSING IN UNDERWATER ACOUSTIC VISION 1905

4 Table 1 Summary of the Data Processing Phases with Indication of the Main Objectives and Typical Applied Methodologies the compressibility of the medium [10], [11]. Typical underwater solid objects (man-made or of natural origin) have an acoustic impedance that is very different from that of the fluid in which they are immersed. As a consequence, only a very small portion of the acoustic energy impinging on an object is transmitted inside it, whereas the remaining part is reflected and scattered outside [10], [11]. This fact differentiates underwater imaging from medical imaging where the low discrepancies in the acoustic impedances of the different layers making up the human body allow one to image both the external boundary and the internal structure of a body organ. One of the most powerful methods for computing the field returned by a complex and realistic underwater object (large with respect to the wavelength) is to represent its surface by a collection of densely packed point scatterers or small facets [12] [14]. These approaches are, to some extent, related to or physically motivated by the Helmholtz Kirchhoff integral (i.e., the basis for many theoretical developments associated with scattering), as discussed in [15], [16]. Here we are not interested in analyzing the motivations, advantages and drawbacks of such methods from a simulation point of view. Our aim is simply to develop a data model that allows us to make a comparison of different image-generation approaches and to point out crucial problems. This goal can be achieved by adopting the method based on a collection of point scatterers: from an ideal point of view, the acoustic characteristics of the surface of man-made or natural object can be reproduced by acting on the positions and dimensions of the scatterers. We can assume that the imaged scene is composed of point scatterers; the th scatterer is placed at the position and its distance from the coordinate origin is equal to, as shown in Fig. 2(a). If an acoustic pulse is emitted by an ideal point source placed in the coordinate origin, and if one assumes that a spherical propagation occurs inside an isotropic, linear, absorbing medium, then the Fourier transform of the pressure field incident on the th scatterer, computed by the free-space Green s function, is proportional to, defined as follows: where Fourier transform of the emitted pulse ; absorption coefficient of the water; angular frequency; frequency; sound velocity in the medium. (1) 1906 PROCEEDINGS OF THE IEEE, VOL. 88, NO. 12, DECEMBER 2000

5 By using these two approximations and adding the effects of the point scatterers, one can obtain the total pressure field received at the position (3) (4) (a) Fig. 2. Notation and geometry of the data model. (a) 3-D representation; the receiving aperture is on the plane z =0. (b) 2-D projection on the plane y =0. (b) A point scatterer can be defined as a simple particle whose dimensions are very small in comparison with the wavelength, and that follows Rayleigh s scattering process [10], [13]. If we define the plane as the plane that receives the backscattered field [see Fig. 2(a)], the Fourier transform of the pressure measured at the position and due only to the action of the th scatterer is equal to where and and radius of the scattering particle; (2) densities of the propagating medium and scatterer; compressibilities of the propagating medium and scatterer; angle between and ( ). To simplify the above equation, it is possible to apply two approximations [8] only for the amplitude of the received field and not for its phase: 1) if the receiving aperture has a limited extension and is centered in the coordinate origin, as shown in Fig. 2(a), then is little and ; 2) for the same reason, ; therefore,. In the above equations, independent scattering [13], [17] has been assumed, i.e., the sound scattered by a particle does not impinge on other particles, hence, multiple scattering does not occur. An equation very similar to (3) is obtained also in the medical and the nondestructive evaluation (NDE) cases [18], [19], where the discrete summation over the scene scatterers is replaced by an integral over an investigation volume in which continuous variations in reflectivity (i.e., a function of the density and compressibility similar to ) should be measured and the absorption of the medium is neglected. The presence of the factor in (3) is a common characteristic of scattering theory. Nevertheless, Henderson and Lacker [13] observed that this frequency coloration is not consistent with scattering phenomena typically encountered in underwater acoustics. Other authors [12], [14], [20] [23] addressing underwater, medical, NDE, aerial applications, and also microwave imaging do not insert this coloration factor in their data models, but, at the same time, they all do not consider the sound absorption of the medium that depends on the frequency. Here we would point out that, for typical frequencies and distances involved in 3-D underwater imaging, the factor is compensated for by water absorption. This can be verified looking at Fig. 3, which shows the profile of the function, for equal to 10, 15, 20, and 25 m, where the equation for the absorption coefficient has been taken from [24]. The function has been arbitrarily defined as being similar to the denominator of the ratio present in (3). One can notice that the value of exhibits a moderate increase over the frequency range [400 khz 1.6 MHz] for m, whereas it is nearly constant for, and m. For larger distances, one can verify that shows a moderate decreasing profile. Considering that real signals have a bandwidth narrower than that used in Fig. 3, in (3), one can introduce a constant factor (dependent on the distance ) to replace the term, thus avoiding the parabolic effect of. Finally, one can write the received field as As the delay of an echo is proportional to the distance of the scatterer that generated it, in typical sonar systems, the receiving channels include an amplifier with a time-varying gain (TVG) to recover the attenuation due mainly to the distance [24]. If we assume that the TVG is applied to ; (5) (6) MURINO AND TRUCCO: THREE-DIMENSIONAL IMAGE GENERATION AND PROCESSING IN UNDERWATER ACOUSTIC VISION 1907

6 Fig. 3. Value of the function L(f ) = (2f) expf02(f )rg versus frequency, for different distance values. then the term will be strictly related to the scatterer reflectivity and independent enough of the scatterer distance. When the cone that has the vertex in the coordinate origin and that contains the scene volume to be imaged has an angular aperture that is not too wide, one can use the Fresnel approximation for the Green function [25], [26] to write the term in the phase of (5) as follows: where is the modulus of and is a unitary vector equal to. Although the Fresnel approximation has often been applied to develop imaging theory [20], [8], [22], it is important to recall that it has a well-defined validity region [26]. In particular, to introduce only negligible errors, the maximum angles between the vectors and the -axis must not exceed 18. Moreover, the distance of the scene should be included in the interval [0.68, ], being the diameter of the receiving aperture and the wavelength. Under these conditions, the received field can be rewritten as follows: When the scene distance is larger than, the farfield condition is fulfilled, the quadratic phase term can be neglected and the Fraunhofer approximation [25], [26] can (7) (8) be used that results in a further simplification However, we shall not consider this approximation to model received signals in 3-D underwater imaging (nor in medical and NDE applications), as the scene to be imaged is very often in the near field, i.e.,,. If, for simplicity, we restrict the reasoning to the plane [see Fig. 2(b)], then,, being the angle between the vector and the -axis, and (8) can be simplified as follows: (10) The angle is referred to the arrival angle as it indicates the direction of the echo of the th scatterer. The approximations for the term that have been discussed are of practical importance as to deal with all combinations of and is sometimes very difficult (even in modern imaging systems exploiting digital technology), as discussed in Section IV. III. IMAGE GENERATION CONCEPTS Like optical systems, acoustic systems can generate an image by processing the waves backscattered by the objects of a scene. In 3-D underwater imaging, a scene is typically (9) 1908 PROCEEDINGS OF THE IEEE, VOL. 88, NO. 12, DECEMBER 2000

7 illuminated by the emission of an acoustic pulse, and the backscattered echoes are collected over a 2-D aperture and processed to create an image of the scene. A planar (i.e., 2-D) aperture is a minimal requirement as a linear aperture is not sufficient to discriminate signals coming from a 3-D space. The operation of echo processing can be performed by two different approaches: use of acoustic lenses followed by a retina of acoustic sensors, or acquisition of the echoes impinging on a planar array of sensors and processing of such echoes by adequate algorithms (thus avoiding the need of physical lenses). Among processing algorithms, both beamforming and holographic methods can be successfully exploited in 3-D imaging systems. Acoustic lenses work like optical ones: backscattered echoes are focused on an image plane where a 2-D retina of sensors transforms the acoustic image into electrical signals. Thanks to the facility of measuring the time-of-flight of an acoustic pulse, one can generate not only 2-D images (similar to conventional optical pictures) but also range estimates that can be utilized to produce a real 3-D map. Each sensor of the retina placed behind a lens receives a signal that represents the scene response coming from a well-defined direction. Collecting the signals of all the retina sensors, one can obtain complete information about the 3-D structure of the scene. Beamforming systems collect backscattered echoes by a 2-D array of sensors only once; then, they arrange the echoes in such a way as to amplify the signal coming from a fixed direction (steering direction) and to reduce all the signals coming from any other direction. As the output signal gives information about the scene structure in the steering direction, it is possible to create a 3-D image by repeating the beamforming process after fixing many adjacent steering directions, as in a raster-scan operation. Also holographic systems start from the echoes acquired by a 2-D array of sensors, but they aim to reconstruct the 3-D structure of a scene by back-propagating the received signals. Acoustic holography is a special case of inverse diffraction and is performed through the inversion of the propagation and scattering equations. An image is not generated by a raster-scan operation, but the holographic algorithm produces the whole image at the same time. Obviously, the inversion of the aforesaid equations is not a trivial task, so many approximations have been proposed. To distinguish between beamforming and holography is not always easy: despite the differences in the principles of the two approaches, sometimes their practical approximations and implementations result in identical algorithms. In addition, the use of the term holography is not entirely correct in this context as we use it to indicate imaging algorithms that start from a record of the backscattered acoustic field, containing both the amplitude and the phase. Instead, originally, the term holography was adopted to indicate a procedure mainly devoted to recording both the amplitude and phase of a propagating wave field [25]. However, this terminology has so deep roots in the acoustic imaging community [27] that we shall use it, even though it is not completely well suited. In the rest of this section, starting from beamforming, we present mathematical descriptions and comparisons of the above-mentioned three techniques, trying to define the related advantages and drawbacks. Moreover, a section is devoted to clarifying the transition from a collection of output signals to a 3-D image. A. The Beamforming Approach In essence, beamforming is a spatial filter that linearly combines the temporal signals spatially sampled by a discrete antenna, i.e., an array of sensors placed according to a known geometry. There are two major categories of beamforming algorithms [28]: data-independent algorithms (also called conventional beamforming) and data-dependent ones (referred to as adaptive and partially adaptive beamforming). In imaging applications, data-independent beamforming is used and successfully implemented, although some efforts to exploit also data-dependent beamforming are in progress and will be described in a next section. Let us consider a set of point-like and omnidirectional sensors that constitute a receiving 2-D array, numbered by the index, from 0 to. Denoting by the position of a given sensor of the set on the plane and by the signal received by the sensor and linearly proportional to the pressure field, one can compute the beam signal [29], (, ), steered in the direction of the unitary vector by using the following definition: (11) (12) where are the weights assigned to each sensor and is the focusing distance. The net result is the formation of a temporal signal in which the contributions coming from the direction and the distance are amplified, whereas those coming from other directions and distances are attenuated. To clearly show the outcome of the beamforming, as at the end of the previous section, we restrict the reasoning to the plane and assume all the weights to be unitary. The effects of real 3-D operations and nonunitary weights will be discussed later on. In the plane, the position vector can be replaced with, and the steering direction can be indicated by the angle measured with respect to the axis [see Fig. 2(b)]; as a consequence,. The beamforming definition in (11) is reduced to The Fourier transform of (13) is equal to (13) (14) (15) MURINO AND TRUCCO: THREE-DIMENSIONAL IMAGE GENERATION AND PROCESSING IN UNDERWATER ACOUSTIC VISION 1909

(a) (b) Fig. 4. (c) Beam power patterns of a 64-element underwater array with 1.5 mm spacing and unitary weight coefficients. (a) Frequency =300kHz, steering =30. (b) Frequency = 300 khz, steering =0.

(d) If we apply the Fresnel approximation to (14), we can rewrite the delay [29] as follows: (16) As corresponds to the field given by (10) and computed at, we can substitute (10) and (16) into (15),

8 (a) (b) Fig. 4. (c) Beam power patterns of a 64-element underwater array with 1.5 mm spacing and unitary weight coefficients. (a) Frequency =300kHz, steering =30. (b) Frequency = 300 khz, steering =0. (c) Frequency = 500 khz, steering =30. (d) Frequency = 500 khz, steering =0. (d) If we apply the Fresnel approximation to (14), we can rewrite the delay [29] as follows: (16) As corresponds to the field given by (10) and computed at, we can substitute (10) and (16) into (15), thus obtaining the following: centered in the coordinate origin), we can rewrite follows: as (18) (19) (17) If, for a while, we make the assumption that the distance of all the scatterers making up the scene is equal to (i.e.,, the scatterers are placed on the surface of a sphere where is a reception diagram commonly called beam pattern, which depends on the arrival angle, the steering angle, and the angular frequency. The effect of setting a focusing distance perfectly equal to the scene distance is the possibility of removing the quadratic term in (7); this means that, for objects placed at such a distance, the imaging system performs like a system working in the far field for which the Fraunhofer approximation in (9) holds. If the array is equispaced and centered in the coordinate origin, 1910 PROCEEDINGS OF THE IEEE, VOL. 88, NO. 12, DECEMBER 2000

9 and if is the interelement spacing, then the beam pattern of the beamforming process [29] can be written in the following closed form: (20) With reference to an array composed of mm-spaced elements, Fig. 4 shows some beam patterns as a function of the arrival angle (visualized on a logarithmic scale normalized to 0 db, provided that the absolute values are considered) for different frequency values and steering angles. As a last hypothesis, we can assume the beam pattern to have a constant profile over the signal bandwidth [20], [22], i.e.,, being the center angular frequency of the signal band. By this approximation and by applying the inverse Fourier transform, one obtains the following expression for the beam signal in the time domain: (21) This result means that each scatterer contributes to the beam signal by adding a replica of the acoustic pulse, delayed on the basis of its distance, weighted by its constant (strictly related to the scatterer reflectivity) and by the beam pattern value that depends on the discrepancy between the arrival angle of the scatterer and the steering angle. This fact shows analogies with the Huygens Fresnel principle [25] and the directivity pattern of secondary sources. Owing to the profile of the beam pattern, the contributions of the scatterers characterized by arrival angles very close to the steering angle have predominant magnitude. Therefore, the beam signal is essentially the sum of the pulse replicas due to the scatterers making up a small area, i.e., the beam footprint, around the point at which the steering direction meets an object s surface. For the surface continuity, the distances of such scatterers are very close and the replicas are overlapped in time. In conclusion, a clear replica of the insonification pulse will be present on the beam signal at a time instant (called time-of-flight, TOF) corresponding to the distance of the scene in the steering direction; moreover, the amplitude of the pulse replica will be proportional to the reflectivity of the scene surface. On the basis of these considerations, the way of moving from a collection of beam signals to a 3-D image will be described in Section III-B. The beam pattern of an imaging system is very important as it allows (like the point spread function in other cases) an objective evaluation of the system performances. A conventional beam pattern presents a main lobe in the direction in which the array is steered and sidelobes of minor, yet not negligible, magnitudes in other directions. The width of the main lobe is the measure of the angular resolution (also called lateral resolution) of the imaging system, whereas the generation of artifacts degrading useful information depends on the level of the sidelobes. The sidelobe level can be reduced by suitably fixing the values of the weight coefficients in (11), i.e., by applying a windowing function [30] [32], [29], but, unfortunately, this operation increases the width of the main lobe, thus worsening the angular resolution of the system. To verify this effect, one can compare the beam power pattern shown in Fig. 4(a) with that presented in Fig. 5(a), where the Dolph Chebyschev window was applied to make the sidelobe level equal to 40 db. When the weight coefficients are unitary, the following equation [29] provides an estimate of the arrival angles at which the main lobe is reduced to 3 db, thus giving a measure of the main lobe s width: (22) To improve the angular resolution, one can increase the number of elements, the frequency, or the interspacing. A comparison of Fig. 4(a) with 4(c) clearly shows the resolution improvement resulting from an increase in frequency from 300 to 500 khz. However, if the frequency is further increased so as to reach 750 khz, the obtained beam pattern is affected by a grating lobe at about 55, as shown in Fig. 5(b). This aliasing effect is due to the spatial undersampling that occurs when the array elements are equispaced and the interelement spacing is larger than. Therefore, to obtain a good angular resolution and to avoid spatial aliasing, a large array made up of a dense grid of elements seems necessary. To reduce the number of sensors, arrays with a spacing larger than are often designed. To avoid ambiguity effects due to the presence of grating lobes in the beam pattern, it is necessary to limit both the insonification and steering operations inside a narrower angular sector [20]. In greater detail, the maximum steering angle to avoid ambiguities is the following: (23) In the real world, the distances of scene scatterers,, are not equal to the focusing distance and the quadratic terms in (17) do not annul each other, consequently, the beam pattern depends also on the arrival and focusing distances ( and, respectively) as follows: (24) In particular, the larger the difference between and, the sharper the deformation of the main lobe in terms of height reduction and width expansion. These two effects can be observed by comparing Fig. 4(c) with Fig. 5(c): the first picture is obtained when, whereas the second is obtained under the same conditions but when m and m (for the clarity of the pictures, a small array is under consideration for which the boundary between the far field and the near field is placed 2.3 m far from the origin). Generally, the depth of field is defined as the range interval around the focusing distance inside which the amplitude reduction of the main lobe does not exceed 3 db [8], [33]. It is important to note that a large depth of field allows all the objects placed inside such a range interval to be imaged with a slight loss in the angular resolution as the broadening of the main lobe width is moderate. To achieve a larger (theoretically infinite) depth of field, the dynamic focusing technique MURINO AND TRUCCO: THREE-DIMENSIONAL IMAGE GENERATION AND PROCESSING IN UNDERWATER ACOUSTIC VISION 1911

10 (a) (b) (c) Fig. 5. Beam power patterns of a 64-element underwater array of 1.5 mm spacing and unitary weights, unless otherwise specified. (a) Frequency =300kHz, steering =30, Dolph Chebyschev weights. (b) Frequency = 750kHz, steering = 30. (c) Frequency = 500 khz, steering = 30, focal distance r =1m, and actual distance r =0:5m. (d) Wide-band pulse composed of 3.5 cycles of a 750 khz carrier, steering =30. (d) has been devised that relies on tuning the focusing distance in synchrony with the arrival times of echoes from different distances [20], [34], [35]. This means that the focusing distance becomes a function of time; for instance, if the transmitter and the receiving array are centered in the coordinate origin, then. As the focusing distance is equal, instant by instant, to the arrival distance, the beam pattern does not suffer from any distortions and the entire imaged volume is in focus. This means that the quadratic term in (7) is compensated for, whatever object may be in the scene; therefore, the imaging system performs like a system working in the far-field for which the Fraunhofer approximation in (9) is still valid. To achieve (21), we hypothesize that the beam pattern profile is constant over the signal bandwidth, meaning that the acoustic pulse contains many carrier cycles and its spectrum is a narrow shape around the carrier frequency. For several reasons, in some real applications acoustic pulses are adopted that are characterized by a large bandwidth and for which the above hypothesis is not valid any more. The beamforming is inherently a wide-band approach [8], so it can be used in such applications. From a mathematical point of view, each frequency bin of the signal is weighted by a specific beam pattern [see (18)] and the beam pattern itself should be taken into account during the inverse Fourier transform in proceeding from (18) to (21). As a result, the final beam signal in the time domain is obtained, but the specific beam pattern contributions are lost. Therefore, the wide-band beam pattern needs a special definition and a few options have been suggested [36], [37]. In imaging applications, the definition of the beam pattern as the maximum amplitude of the beam signal when the pulse comes from a given arrival direction has been extensively adopted [36] [39]. Following this definition, the beam pattern of the above-mentioned array has been computed by applying the quasi-closed form proposed in [37], and is shown in Fig. 5(d). A rectangular pulse with a carrier 1912 PROCEEDINGS OF THE IEEE, VOL. 88, NO. 12, DECEMBER 2000

11 frequency of 750 khz and lasting for 3.5 cycles has been assumed. This beam pattern has been obtained under the same conditions as those shown in Fig. 5(b), but with a larger bandwidth: the comparison shows that the wide-band beam pattern follows the envelope of the narrow-band one, except around 55 : at this value, the grating lobe has been reduced and has become similar to a sidelobe. The grating-lobe reduction [20], [37] is one of the main reasons for using wide-band signals. These effects are due to the fact that, in narrow-band beam patterns, the positions of the side and grating lobes shift when the frequency varies, whereas the main lobe s position is kept fixed. In addition, the wide-band beam pattern is a sort of overlapping of the narrow-band beam patterns computed at the frequencies contained in the signal band. The beam patterns have been developed and discussed for a plane space and taking into consideration a linear array, although this configuration does not allow one to obtain 3-D images but only a section of a scene profile [see Fig. 2(b)]. To steer the beam inside a 3-D space, two steering angles should be used and a planar array is mandatory. The extension of the theory to a 3-D case is conceptually straightforward but heavy in terms of the notation and visualization; therefore, we face this issue in passing. Fig. 6(a) shows the geometry and notation for a planar array where and are the azimuth and elevation arrival angles, respectively, and and are the azimuth and elevation steering angles, respectively. To compute the beam signal in the ( ) direction, one should sum the signals received by the array elements as in (13) and (14), after delaying each of them by (25) where and are the coordinates of the th array element and the sign means that the Fresnel approximation has been applied. In the particular case of square arrays centered in the coordinate origin, regularly spaced, with unitary weight coefficients, and according to the conventions of Fig. 6(a), one can verify that the resulting beam pattern,, can be computed as follows: (26) where is computed by (20) fixing equal to the number of elements making up a side of the square array. Fig. 6(b) shows the beam pattern for an array composed of elements that are apart, when the steering angles are and. When the angular extent of the scene to be imaged is much larger than that of the validity region of the Fresnel approximation [26], notable errors are incurred in the beamforming operation; they result in distortions of the beam pattern, as pointed out in [40]. If such distortions are not acceptable, the beamforming should be performed using exact delays, as in (12). The consequences of the adoption of exact delays, in terms of implementation scheme and computational load, are analyzed in Section IV. B. From Beam Signals to Resolution Cells Before proceeding with the analysis of lens- and holography-based systems, it may be useful to clarify how beam signals can be exploited to generate a 3-D image. To this end, it is necessary to define the range resolution and to recall the angular resolution. Inside the wide interval called depth of field, the range resolution is defined as the minimum distance between two equal scatterers (placed in the same beam direction) that is needed to resolve their responses. The range resolution [14], [41] is typically inversely proportional to the bandwidth of the emitted pulse. Analogously, the angular resolution is the minimum angular spacing that allows two equal scatterers placed at the same distance from the array center, to be resolved. As previously said, it depends on the wavelength, the array dimensions, and, though slightly, the steering angle. A resolution cell can be defined as the volume (bounded by the lateral and range resolutions) inside which it is not possible to separate scatterer contributions [14]. The dimensions of resolution cells are not constant over the volume of interest, as depicted in Fig. 7, where, for simplicity, a 2-D situation is presented. To take a 3-D image of the volume of interest, it is important to arrange a grid of resolution cells, adjacent or partially overlapped, to cover the whole volume without leaving free holes (see Fig. 7). To accomplish in this, the number of beam signals to be computed and their angular spacing should be carefully planned, and the sampling frequency of each beam signal should be in agreement with the range resolution [29]. The acoustic responses of the scatterers contained inside a cell is a function of the reflectivity and of relative position of each scatterer. The interference among the signals reflected by the scatterers provide the overall reflectivity of the resolution cell; the reflectivity value can be derived from the amplitude of the beam signal over the time interval related to such a cell. In general, the envelope or the square envelope (i.e., the intensity) of the beam signal is considered and a time sample of it is assigned to the resolution cell. A matched filtering of the beam signal with the emitted pulse is often performed (called pulse compression) before extracting the envelope or the intensity. Resolution cells span the whole volume to be imaged, whereas scatterers (defined in Section II) are mainly placed on object surfaces. Disregarding water reverberation, if a given cell does not contain any object, its acoustic response is null. As the output of a 3-D acoustic system, one can obtain the whole information (organized in different ways, depending on the specific system) related to the grid of the resolution cells covering the volume of interest. For each cell, the position of its center (in Cartesian or polar coordinates) and the related acoustic amplitude or intensity should be provided. The dimensions of each cell can be derived from the knowledge of the system s angular and range resolutions. As a result, the volume of interest can be partitioned into a dense 3-D lattice of cells in which the acoustic response of the scene inside each cell is known. Starting from this collection MURINO AND TRUCCO: THREE-DIMENSIONAL IMAGE GENERATION AND PROCESSING IN UNDERWATER ACOUSTIC VISION 1913

12 (a) Fig. 6. (a) Notation and geometry of a 2-D array. (b) Beam power pattern of an underwater array composed of elements that are =2 far from each other, when the steering angles are =0 (azimuth) and = 030 (elevation), and the weights are unitary. (b) of cells, one can organize the effective information in more compact ways, discarding useless cells or directly extracting object surfaces, as will be discussed in Sections VII and VIII. Fig. 7. Sketch of a 2-D imaging system: the area of interest is entirely covered by a collection of resolution cells of different dimensions, in the beam (steering) directions. C. The Acoustic Lens In this section, the functioning of an ideal acoustic lens is described, regardless of the problems of practical realization and efficiency, which will be faced later on. Therefore, the acoustic lens considered here acts exactly as an optical lens, except for the nature of the processed field. Without loss of generality, we consider the case of a double-convex lens [22], [25], with a diameter equal to, placed on the plane, and centered in the coordinate origin, as shown in Fig. 8(a). The projection, through the lens, of the acoustic field impinging on the right face produces the addition of a quadratic phase shift, and limits the field to that 1914 PROCEEDINGS OF THE IEEE, VOL. 88, NO. 12, DECEMBER 2000

13 If, as in Section III-A, we make the assumption that the distance of all the scatterers making up the scene is equal to, and if we apply the lens equation [25] (30) we obtain the final result (a) (31) (32) (b) Fig. 8. (a) Notation and geometry of a lens-based imaging system. (b) 2-D beam power pattern of an underwater lens with an aperture of 9.45 cm when the frequency is 500 khz, for an observation angle =30. As a last hypothesis, we can assume the beam pattern of the lens to have a constant profile over the signal bandwidth [20], [22], i.e.,, being the center angular frequency of the signal band. By this approximation and applying the inverse Fourier transform, we obtain the following expression for the signal received by an ideal point sensor placed behind the lens, in the direction and at the distance from the lens itself: received along the aperture of the lens. The field on the left face of the lens,, can be computed as follows: otherwise (27) where is a characteristic of the lens called the focal length. By applying Green s theorem to the field, one can compute the field that is measured at the distance in the direction [see Fig. 8(a)] (28) By applying the Fresnel approximation to the exponent in (28) and by inserting (10) and (27) in (28) one obtains the equation (29) (33) This result is very similar to that obtained for the beamforming case in (21), but here the delay of the replicas that make up the signal is augmented by the fixed term (to keep into account the path from the lens to the sensor), and the definition of the beam pattern is slightly different as the spatial aperture of the lens is continuous and not discrete. For a fixed center frequency of 500 khz and an angle, Fig. 8(b) shows the beam pattern of a lens with a diameter equal to the spatial aperture of the array considered in Section III-A, i.e., 9.45 cm. If one compares this beam pattern with the corresponding one produced by the beamforming [see Fig. 4(c)], one can observe that they are very similar, except for the classical turnover effect introduced by the lens and for a lower sidelobe level far from the main lobe. Therefore, by arranging on a spherical surface [22], [42] of radius a dense retina of sensors (in accordance with the angular resolution), one can gather signals useful to build a 3-D acoustic image. In particular, the signal received by the sensor placed in the direction shows a replica of the insonification pulse at a time instant dependent on the distance of an object in the direction, whereas the amplitude of such a replica gives us information about the object reflectivity. The collection of all the signals allows us to define a dense 3-D lattice of resolution cells and to fill them with the related acoustic responses, exactly as MURINO AND TRUCCO: THREE-DIMENSIONAL IMAGE GENERATION AND PROCESSING IN UNDERWATER ACOUSTIC VISION 1915

Fig. 9. 3-D beam power pattern of an underwater circular lens with an aperture of 2.1 cm when the frequency is 500 khz, for an observation direction =0, =30.

14 Fig D beam power pattern of an underwater circular lens with an aperture of 2.1 cm when the frequency is 500 khz, for an observation direction =0, =30. explained in the previous section for the beamforming case. Moreover, in the case of wide-band signals (i.e., when the approximation does not hold any more), no problem prevents the use of the lens, and, as described in Section III-A, a new wide-band beam pattern can be computed that is very similar to that of beamforming. Despite the beam patterns similarity, the acoustic lens involves some problems that can be easily overcome in systems based on beamforming. First of all, also when the lens is used, the beam pattern degrades [25] when the distances of the scatterers,, are not equal to the focusing distance,, and the depth of field is like that of beamforming. However, the regulation of the focusing distance requires a nontrivial mechanical rearrangement of the lens system [43], involving also the distance between the lens and the sensors. Moreover, the focusing distance should be known and fixed before the emission of the pulse; above all, the dynamic focusing technique cannot be applied. Second, the phase shift performed by the lens [see (27)] is satisfactory only when the Fresnel approximation applies to the data model. If a scene lies out of the validity region, in beamforming systems one can adopt not approximate delays, whereas in lens systems it is much more difficult to perform a correct phase shift. However, these problems can be overcome by arranging a specific group of lenses and performing an aspherical correction of their surfaces [25], [8], [43]. Finally, unlike beamforming systems, lens systems do not make it possible to equalize or modify the sidelobes by using ad hoc weight coefficients. Moving from a plane space to a 3-D one, let us denote by and the angles that indicate a given retina sensor behind the lens, and by and the angles that indicate the arrival direction, in a way similar to that adopted in Fig. 6(a) for the beamforming case. According to this notation, one can verify that the 3-D beam pattern of a lens,, is defined as follows: (34) where is the Bessel function of order one. As an example, Fig. 9 shows the 3-D beam pattern of a circular lens in diameter, when is 0 and is 30. This beam pattern can be directly compared with that presented in Fig. 6(b) for a square array whose side length is equal to the lens diameter. D. The Holographic Approach In several holographic systems, the acoustic field collected on the plane is adequately processed to reconstruct the scene reflectivity on a different plane [8], [19], [21], [44], [45], [18], without need for the Fresnel approximation and under both narrow-band and wide-band conditions. The main limitation of these procedures is that the volume of interest is laterally bounded by the projection of the receiving aperture. In other words, the obtained image cannot show a scene larger than the 2-D array, whereas no constraints are imposed on the range extension of the scene. This fact can 1916 PROCEEDINGS OF THE IEEE, VOL. 88, NO. 12, DECEMBER 2000

15 be acceptable in medical and NDE applications or in synthetic aperture systems, but it is not acceptable in underwater 3-D imaging, where a large angular sector has to be investigated by using a narrow 2-D aperture. Therefore, we start describing a particular version of an holographic method called Fourier/Fresnel transformations [8], well suited for underwater imaging under the narrow-band hypothesis. Let us consider a set of point-like and omnidirectional sensors that make up a receiving array, numbered by the index, from 0 to. Assuming to work inside the validity region of the Fresnel approximation and restricting the reasoning to the plane, the first operation is a phase-shifting of the received field simple multiplication of by. It is very important to note that the beam pattern of the holographic method,, is equal to that of the beamforming approach, as defined in (20), and provides the following relation [29] between and : (40) Under very narrow-band conditions, we can consider the signal spectrum as a very thin shape around the center angular frequency. As a consequence, and (40) becomes. By this approximation and applying the inverse Fourier transform, one obtains the following signal in the time domain: (41) (35) where is the received field given by (10) and corresponds to the Fourier transform of the signal received by the array sensor placed in. As in previous cases, we can hypothesize for now that the distances of all the scatterers,, are equal to the focusing distance, thus obtaining the following: (36) The second operation is a discrete Fourier transform (DFT) along the array baseline, at a fixed value, thus moving from the index (position) to the index (spatial frequency): (37) where is periodic on the -axis with period.if the linear array has an interelement spacing equal to and it is centered in the coordinate origin, one can verify that (38) (39) The phase term out of the summation in (38) can be neglected as it does not depend on and can be compensated for by a A comparison of (41) with (21) and the beam pattern equivalence due to relation (40) show that the signal and the beam signal are perfectly equal, despite the differences in the procedures to obtain them. Therefore, all the previous considerations about the meaning and the use of the beam signals can be directly applied to this specific holographic case, as well as all the considerations about: 1) the sidelobe reduction by the exploitation of nonunitary weight coefficients; 2) the limitation on the interelement spacing to avoid the introduction of grating lobes; 3) the effect of an imperfect focusing and the depth-of-field extent; and 4) the possibility of implementing the dynamic-focusing concept [35]. We recall that, for each, the solution of (40) exists and is unique [29] only when is equal to, being the wavelength corresponding to the angular frequency. For shorter than, the solution does not always exists, whereas for larger than (as often occurs), the solution is not unique, i.e., grating lobes are present in the beam pattern. As said for beamforming, in the latter case, if a restricted insonification is performed and for each the solution of (40) that has the minimum absolute value is chosen, then it is possible to generate a complete image of the scene placed in the angular sector given by (23), without any ambiguity. Moving from a plane space to a 3-D space, the holographic process can be summarized [8] in the following steps: 1) the signals collected by the sensors of the 2-D array are Fourier transformed; 2) each frequency bin of each signal is phase shifted (only the bins inside a narrow interval around are not null); 3) for each frequency bin, a 2-D DFT over the array aperture is performed, moving from the array indexes to a pair of spatial frequencies; 4) for a given pair of values of the spatial frequencies, the inverse Fourier transform results in a time domain signal that represents the beam signal in a given steering direction. To find the steering direction, it is necessary to solve the 2-D extension to (40) that relates a pair of spatial frequencies to a pair of angles [46]. Although it is not possible to set a priori the desired steering directions, one can verify MURINO AND TRUCCO: THREE-DIMENSIONAL IMAGE GENERATION AND PROCESSING IN UNDERWATER ACOUSTIC VISION 1917

16 that the steering directions linked to the pair of spatial frequencies are arranged into a well-organized grid, which is very useful in generating a 3-D image. The 3-D beam pattern of this specific holographic process is equivalent to that shown in Fig. 6(b) for the beamforming approach. In addition, the described holographic algorithm coincides with the implementation of a narrow-band focused beamforming in the frequency domain [35], [46]. A wide-band extension of this method can be developed as described in [29], [47] [49] for the frequency-domain beamforming; the main difficulty lies in the fact that (40) depends on the angular frequency. As all the mentioned 1-D and 2-D Fourier transforms can be performed by the fast Fourier transform (FFT), the described approach is very interesting from a computational point of view. However, if a scene is very far from the validity region of the Fresnel approximation, steps 2) and 3) cannot be performed any more in a separate way, thus preventing the use of the Fourier/Fresnel transformations method. To avoid the Fresnel approximation and, at the same time, to process wide-band signals, we describe a second holographic approach based on a matrix formulation that is used by a few real 3-D systems [50], [51]. To this end, it is necessary to express the data model presented in (5) in a different manner. Let us denote by the column vector of the field received by sensors placed in ( being an integer ranging from 1 to ) and by the column vector of the reflectivity of each resolution cell contained in the scene volume to be imaged. The number of cells,,is different from the number of scatterers,, used in the previous data model, and if the th cell, placed in ( being an integer ranging from 1 to ), does not contain any object, its reflectivity may be null. The data model presented in (5) can be rewritten as follows: (42) where is an transfer matrix whose element is defined as follows: (43) The imaging process lies in estimating the vector starting from the knowledge of and the experimental measure of. As, in a 3-D imaging system, the total number of resolution cells is too large to be handled by a single matrix, one can rewrite (42) in the following way: (44) where is a regular grid of cells on a spherical surface of radius, is the Fourier transform of the field received over a brief time interval starting at, and is the related transfer matrix. As a result, the resolution cells of the 3-D volume are organized as a sequence of concentric spherical layers and a specific transfer matrix is defined for each layer. The best estimate,, of the vector [50], [51], assuming, is given by the minimum-norm solution of (44), that is (45) where is the complex conjugate and transpose, is the pseudoinverse, and the matrix arguments are neglected to simplify the notation. A further discussion of the implementation and performances of this method is provided in the next section. IV. CHALLENGES IN SOLVING SHARED PROBLEMS Unfortunately, there are several problems that slow down the progression from image-generation concepts to real systems characterized by an acceptable tradeoff between performances and costs. In this section, we try to face these problems (often shared by the three methods of image generation) and analyze some potential solutions that have been proposed. We aim not to review all the problems and solutions in an exhaustive way, but simply to focus on some of them that we consider particularly crucial problems and promising solutions. We refer the reader to the cited literature for complete details of the mentioned methods. Two critical issues (linked to each other) in the development of 3-D acoustic imaging systems are the computational load and the hardware cost. The current trend is to design systems in which each received signal is immediately time sampled and amplitude digitized; then the whole processing and rendering are performed by a digital architecture controlled by a specific software. In this sense, an acoustic lens completely avoids the computational load, as the spatial processing of backscattered echoes is physically performed, but it involves high hardware costs due to the needs for the lens itself and of a dense retina of sensors (one sensor is needed for each signal, i.e., for each beam signal). To build an understandable 3-D map, at least beam signals seem to be necessary, thus requiring 4096 sensors and acquisition channels. More correctly, the number of necessary beams depends on the ratio of the angular extent of the scene to be imaged to the angular resolution of the imaging system. Therefore, the larger the spatial aperture, the larger the number of necessary beams to cover the scene. As the signal is an output giving us the acoustic response coming from the direction, it is sufficient to acquire its envelope (or intensity) by using a sampling frequency consistent with the range resolution. For instance, a sampling frequency of 75 khz is ideally sufficient to match a range resolution of 1 cm, thus reducing the cost of the acquisition channels. In beamforming systems, the correspondence between beams and sensors is not one-to-one, as for a lens, but it is easy to verify that, in equispaced 2-D arrays, to avoid spatial undersampling, the number of sensors is only a little smaller than the number of beams [29]. Moreover, an additional cost is due both to the fact that one must sample the echo waveform instead of the envelope and to the fact that the sampling operation must be simultaneous in all the channels [29]. To reduce the sampling frequency and the stored samples, the quadrature reception scheme is often adopted [29], [52], [53] that requires a sampling frequency equal to the signal bandwidth and not dependent on the carrier frequency. The related quadrature beamforming involves both a time shift and a phase shift of each signal before the summation [29], [54] PROCEEDINGS OF THE IEEE, VOL. 88, NO. 12, DECEMBER 2000

17 More generally, due to the huge numbers of acquired signals and beams that have to be computed, the computational load of beamforming is often considered prohibitive for 3-D imaging, despite many different implementations have been proposed [29], [54], [48], [49]. Basically, these implementations rely on the time domain or on the frequency domain (thus exploiting the FFT advantages). In the following, we list a selection of frequently encountered problems. 1) The frequency-domain beamforming can be used only when the Fresnel approximation is acceptable [40]; moreover, it reduces the freedom of choosing the steering directions. 2) When the bandwidth of the signal is wide, the frequency-domain beamforming may lose its computational convenience [29], [48]. 3) The time-domain beamforming (also in the quadrature case) requires the use of a sampling frequency that is a few times higher than the Nyquist one or the use of an equivalent degree of interpolation [29]. 4) The computation of the exact delays that must be used when the Fresnel approximation is not acceptable is heavy and should be performed offline. Moreover, when dynamic focusing is applied, the amount of storing memory to contain the delays is quite large. Concerning holographic systems, although the computational load of the Fourier/Fresnel transformations method is low, in many practical cases the assumption of a very narrow band and the unavoidable application of the Fresnel approximation are not acceptable. The matrix approach overcomes these two problems but it requires an offline computation and the storing of an inversion matrix for each layer and for each frequency bin considered. If, for instance, we assume a array and resolution cells in each layer, the matrix contains more than 10 million elements. Therefore, even when only one frequency bin and a few tens of layers (i.e., discrete range values) are used, the amount of storing memory needed is quite large. Moreover, the computation of the cell reflectivity,, should be performed by using a hardware architecture well suited for matrix algebra. After fixing the desired angular resolution, a reduction in the number of array elements is an effective way to lower both the hardware cost and the computational load. In beamforming and holographic systems, there are two ways of achieving a sharp reduction in elements, without introducing any grating lobe into the beam pattern: synthesizing a nonequispaced sparse array, or adopting an equispaced sparse array working with very wide-band signals. If the former procedure is used, although the condition is not fulfilled, grating lobes are avoided because there are no periodicities at the positions of the sparse array elements. The main drawback is an increase in the level of the sidelobes; therefore, it is essential to minimize the number of elements and to optimize their weight coefficients, while maintaining adequate beam pattern characteristics. Optimum methods have been proposed [55], [56] that are able to accomplish the former procedure by starting from dense (i.e., -spaced) arrays that have fewer than about 100 elements. To face a larger array, a stochastic method based on the simulated-annealing algorithm has been proposed by the authors [57], [58]: it is able to find solutions that are very close to the optimum ones. For instance, the 3228 elements of a dense array can be reduced to 359 active elements, while the sidelobe peak does not exceed 21.2 db. If the latter procedure is used, as shown in Fig. 5(d), the widening of the signal bandwidth reduces the grating-lobe level to an acceptable value, thus allowing the use of an equispaced array that does not satisfy the condition. The grating-lobe reduction does not depend on the element spacing (this affects only the distance between the main lobe and the grating lobes) and can be predicted on the basis of both the pulse band and waveform, as the authors described in [37] and [59]. Therefore, the adoption of wide-band signals is an effective way of designing very sparse arrays with acceptable grating- and sidelobe levels [20], [38], provided that the number of elements is sufficient to ensure the desired signal-to-noise ratio (the latter depends on the number of elements through the array gain [10], [29]). From the above discussion, one can deduce that the acceptance of the Fresnel approximation allows the use of some imaging algorithms that are convenient from a computational point of view. In order to widen the validity region of such an approximation without degrading its precision, one of the authors proposed to weight the last two terms on the right-hand side of (7) with two coefficients computed by a least-squares procedure. The weighted Fresnel approximation has been tested on both linear [40] and 2-D [60] arrays, achieving an increase of 1.5 times in the angular aperture of the validity region, i.e., from to. After minor modification, this new approximation can be adopted by all the algorithms that have previously used the Fresnel one. In the acoustic-imaging field, a serious problem is the presence of sidelobes in the beam pattern as it results in sharp blurring effects inside the computed image. In the previous section, we have remarked that the weight coefficients (also referred as window functions) allow one to reduce the sidelobe level, though at the cost of loss in resolution. (Weight windows can be applied in both beamforming and holographic systems but not in lens-based systems where the sidelobe level cannot be easily regulated). Therefore, thanks to the selection of adequate weights, one can fix the final tradeoff between resolution and blurring effects on the basis of the application considered [14], [29] [31]. However, this reasoning assumes the bandwidth of the signals to be very narrow: weight windows yield results different from the expected ones if applied to wide-band signals. For such signals, the effect of a weight window depends on both the bandwidth and the envelope of the acoustic pulse [59], [37], [36], [61]. Unfortunately, it has been reported that traditional windows often worsen the beam pattern profile. So far, the synthesis of weight coefficients well suited to work under realistic wide-band conditions has received little attention in the literature [38], [61] [63] and only a few attempts have been made. Nevertheless, this topic is of great practical importance as a wide-band beam pattern (unlike a narrow-band one) has not a constant energy, thus MURINO AND TRUCCO: THREE-DIMENSIONAL IMAGE GENERATION AND PROCESSING IN UNDERWATER ACOUSTIC VISION 1919

18 Fig. 10. Wide-band beam power pattern of a 125-element underwater array with =2 spacing and unitary weights (dotted line) or optimized weight (solid line). The adopted pulse has a Gaussian-shaped envelope lasting for two carrier cycles. allowing a sidelobe equalization with a very reduced loss of angular resolution, as shown in Fig. 10. This figure compares the beam patterns of a 125-element array, when a Gaussian-shaped pulse of a duration equal to two carrier cycles is used: the beam patterns were obtained by applying and not applying a set of optimized weight coefficients. Such optimized weights were computed by a wide-band extension [63] to the simulated-annealing method of synthesis devised by the authors and described in [57], [58], and [64] [66]. Adaptive beamforming is a technique that allows one to reduce the sidelobe level and, at the same time, to narrow the main lobe width. It is commonly used in radar and sonar systems to suppress unwanted interferences [28], but it is not exploited in imaging systems. In [67], adaptive beamforming has been applied to a 3-D acoustic system devoted to zooplankton imaging to increase the dynamic range and improve the resolution and accuracy. Adaptive beamforming techniques involve the computation of weight coefficients (following different approaches) on the basis of received signals (data-dependent beamforming), often exploiting the input covariance matrix. In particular, minimum variance beamforming determines the weights by minimizing the beam signal power under the constraint that the amplitude response in the steering direction be kept constant [28], [67]. Assuming that the number of scatterers producing echoes is smaller than the number of sensors, and that such echoes are not correlated with one another, the theory shows that the obtained beam pattern is much better than that of conventional beamforming. However, these two conditions are fulfilled only in several specific applications, like monitoring of plankton organisms; thus, the great advantages of adaptive beamforming are not always achievable. In the previous section, the beam pattern has been regarded as a function describing the ability of the system to spatially discriminate among signals coming from different directions and distances. However, if the array elements emit instead of receiving, the beam pattern represents the ability of the system to focus the acoustic energy in a given direction and to a given distance. Potentially, an acoustic system can be composed of two different arrays for the transmission and reception of signals, or a single array can be used for both operations by using monostatic transducers. In these cases, the comprehensive beam pattern is given by the product of the transmission beam pattern by the reception one [24], [62]. When the transmission and reception beam patterns are perfectly equal, the comprehensive beam pattern has a narrower main lobe (producing an improvement in resolution) and reduced sidelobes. Moreover, if two sparse arrays are used to transmit and receive, respectively, it is possible to tune the element spacings in such a way that the transmission grating lobes do not overlap with those of the reception array, and the comprehensive beam pattern shows sidelobes only [62], [68]. However, due to the directivity of the transmission, to collect information about a space region, it is necessary to repeat the insonification process many times, following a scanning scheme. In 2-D medical imaging, where distances are very short and just one plane is examined, successive insonifications do not prevent one from collecting many images per second. In 3-D underwater imaging, the emission of beams to image a scene that is 100 m distant would take about 10 min. Therefore, despite the importance of achievable results, these methods should be discarded for real systems that require high frame rates [69], [70]. A promising way of overcoming this problem is to transmit many beams at the same time by using narrow-band pulses at different frequencies or wide-band specially coded waveforms [41], [69], [70]. To the best of the authors knowledge, no real 3-D underwater systems have been experimented that implement these techniques, and only a designing activity has been described in [70], considering the use of an acoustic lens where the transducers of the retina are utilized to both transmit and receive. A major problem affecting all coherent imaging systems (like acoustic ones) is speckle noise. This term indicates the interference at the detector of the signals backscattered by the small-scale topographic relief of an object s surface (i.e., roughness) [17]. The random difference introduced by speckle among the amplitudes of the beams steered toward a given surface [8], [14], [71] is a sharp source of noise that involves boundary breaks and causes large targets to appear as multiple smaller ones, i.e., affected by granular patterns. Speckle reduction is a deeply discussed issue for which various methods have been proposed that aim to generate images with less noise or to reduce speckle by a postprocessing operation. In the first case, as the speckle patterns due to different frequencies are uncorrelated, one can reduce speckle noise if several images by using narrow-band signals (i.e., different frequencies) are acquired and then merged [72], [73]. This produces an average effect that tends to smooth the appearance of a surface. An analogous result can be achieved by adequately exploiting the different frequencies contained in the spectrum of a wide-band pulse. Another way of obtaining images of the same scene with uncorrelated speckle patterns 1920 PROCEEDINGS OF THE IEEE, VOL. 88, NO. 12, DECEMBER 2000

19 is to move the observation point (i.e., spatial diversity), but, in this case, the merging operation needs the solution of the spatial correspondence problem [72], [73]. However, these two techniques require that various images of the same scene be merged, thus inhibiting real-time functioning; in addition, the bandwidth of realistic pulses is not sufficient to abate speckle completely. As a consequence, speckle reduction is one of the main postprocessing purposes that have been faced by adaptive filtering [71], [74], by the exploitation of the statistics of the components of the noisy signal [75], and by probabilistic techniques based on a priori information. These approaches will be described in the next sections. An additional problem shared by the three image-generation concepts described in the previous section lies in the specular reflections produced by many real objects. Surfaces of man-made objects (often the most interesting in 3-D imaging) tend to be quite smooth, whereas those of natural objects tend to be rough. Due to the relatively large wavelength of the acoustic waves used in 3-D underwater imaging ( mm at 500 khz), man-made objects show a strong reflection in the specular direction [8], thus resulting in images with a very high contrast. This means that a cylinder is imaged as a line, a sphere as a single point, and a flat surface as a narrow strip [14]; therefore, recognition becomes a very hard task. To portray a scene composed of objects with surfaces that are rough with respect to the wavelength is the optimal solution of the problem, but this is often unrealistic. Other solutions rely on the a priori modeling of the responses of a smooth surface and on the consequent model-based surface reconstruction [76] [78], as the next sections will briefly consider. Moreover, the progressive incrustation process from which man-made objects typically suffer when immersed provides an additional roughness that often reduces this problem. Finally, the spatial and temporal coherences of the medium affect both the beam pattern and the signal-to-noise ratio of the imaging system. These phenomena are currently under analysis [79] [81] with reference to long range sonar systems using frequencies from 100 Hz to a few kilohertz and for distances longer than 1 km. As the spatial distances and the time intervals involved in the sonar systems addressed in this paper are quite short, the spatial and temporal coherences are probably of minor importance as compared with the previously faced problems. However, to the best of the authors knowledge, it is not possible to conclude on this point because the effects of coherence on systems using high frequencies have not been sufficiently investigated yet. V. REAL SYSTEMS AND RELATED ADVANTAGES In this section, we briefly describe some interesting 3-D underwater imaging systems reported in the literature in the last years. We focus on systems that have been built and that are available as commercial products or as a prototypes. To the best of the authors knowledge, the only off-the-shelf 3-DacousticcameraavailableistheEchoScope1600designed and produced by OmniTech in Norway [51], [82], [6]. The number of acoustic sensors is 1600, which make up a array with a 19.5 cm side, able to work at three different frequencies: 150, 300, and 600 khz. The received echoes are processed by a matrix holographic approach, like that defined in (44), which avoids the Fresnel approximation. Moreover, the value of the angular frequency is fixed, thus the Fourier transform is not necessary as the amplitude and phase data on a received signal are obtained by a quadrature reception [29]. For each spherical surface of radius, a grid of partially overlapped resolution cells is determined, and the best estimate of the related vector is obtained by means of (45), where the spectral theorem is exploited to decompose the matrix ( ), as described in [51]. As the intersensor spacing is fixed, to avoid grating-lobe effects the viewing and insonification angles are bounded as follows: at 150 khz, at 300 khz, and at 600 khz, and the lateral resolution (measured at 3 db, with unitary weights) are 2.5, 1.3, and 0.6, respectively. The allowed detection range spans from 1 to 100 m and, thanks to the powerful computation architecture, it is possible to achieve on average five 3-D images per second with a range resolution of 5 cm. Owing to the above-mentioned features, the EchoScope 1600 represents a powerful way of obtaining real-time 3-D information over a very large range of distances, thus allowing one to choose the best tradeoff among resolution, viewing angle, and sidelobe level (this level can be tuned by using different weighting windows). The information acquired by a single acoustic pulse (taking into account only the points that exceed a given threshold) can be arranged into an artificially illuminated 3-D grid, as shown in Fig. 11 for quite different application contexts. As an alternatively to artificial illumination, the intensity of the echo related to each point or the horizontal distance can be overlapped with the 3-D grid by using a suitable color palette. Two interesting systems that use an acoustic lens have been devised and tested at the Draper Laboratory in Cambridge, MA, and at the Naval Research Laboratory in Washington, DC, in cooperation with the University of Washington in Seattle, WA (NRL-UW). Experiments have been carried out inside tanks. In the first laboratory [43], [83], a four-element lens 19 cm in diameter has been designed and fabricated that has a viewing angle and can work at different frequencies, in the range from 500 khz to 2 MHz. The measured angular resolution is equal to 0.24 at 2 MHz and 0.5 at 1 MHz [43]. The realization of a large, dense retina of hydrophones to be coupled with the lens is under development [83] [85]. For demonstration purposes, experiments have been carried out repeating the insonification and mechanically stepping a array (composed of micromechanical hydrophones) in order to cover the entire area of the desired retina [83]. Although, before testing the real retina, it is not possible to draw any conclusions about this system, the signals received through the lens were of excellent quality and allowed the generation of precise 3-D images, as shown in Fig. 12. In the second laboratory [5], [42], a lens 25 cm in diameter has been filled with liquid; the lens is followed by a hemispherical retina MURINO AND TRUCCO: THREE-DIMENSIONAL IMAGE GENERATION AND PROCESSING IN UNDERWATER ACOUSTIC VISION 1921

(a) (b) (c) Fig. 11. Rendering of 3-D data acquired with the acoustic camera OmniTech EchoScope 1600, with artificial illumination, and for an arbitrary position of the observation point of view.

(c) The wreck of the MS Inga (steel vessel located in Zeeland) lying on its starboard side. Detail image of the hole in the side of the vessel; on the left, the steel plate has been bent in.

) (d) populated with transducers in eight rows of 16 elements each.

As the angular resolution of the lens is about 1.5, the 128 transducers acquire a 3-D image that is a sparse undersampled grid.

20 (a) (b) (c) Fig. 11. Rendering of 3-D data acquired with the acoustic camera OmniTech EchoScope 1600, with artificial illumination, and for an arbitrary position of the observation point of view. (a) The wall of a lock in Zeeland, the Netherlands. The upper half is concrete and the lower half is corrugated steel. (b) A snapshot of an underwater rig structure 16 m high and 20 m wide. (c) The wreck of the MS Inga (steel vessel located in Zeeland) lying on its starboard side. Detail image of the hole in the side of the vessel; on the left, the steel plate has been bent in. (d) A man who is walking in a 1.5-m deep pool. Apart from the man s feet and legs, one can see the tile pattern of the pool floor. (Data in panels a and c: courtesy of Rijkswaterstaat.) (d) populated with transducers in eight rows of 16 elements each. This system works at 300 khz over a distance of about 100 m, and the transducers are placed in such a way as to cover a viewing angle of 48 in azimuth and 12 in elevation. As the angular resolution of the lens is about 1.5, the 128 transducers acquire a 3-D image that is a sparse undersampled grid. To get a dense data set, the lens should be mounted on a moving platform and made to pass over a scene in order to synthetically obtain a denser retina. This data-collection strategy involves the problem of the sensors positions and the difficulties with registering and combining collected data. Fig. 13 shows 3-D images reconstructed from data collected by the NRL-UW lens-based system moving along straight line trajectories inside a tank. Several systems exploit the duality of the beamforming steering for both transmission and reception in order to build a 3-D image by a combined scanning of a scene. In these systems, at each pulse emission, a vertical transmitting array insonifies a narrow slice in elevation, and a horizontal receiving arrays detects the signals coming from a set of azimuth angles across each slice. This configuration, made up of two linear array, is called Mill s Cross; its main drawback is the time taken to scan a whole 3-D scene. To obtain a grid of points by the Mill s Cross scheme, 64 pulse emissions are required; if the maximum range is fixed at 100 m, 0.13 s is the minimum time between two successive emissions, and 8.5 s is the total scanning time. Therefore, these systems provide an acceptable frame rate only over much shorter distances, (e.g., one frame per second is achievable over a distance of about 10 m). Obviously, the main advantage lies, in the limited numbers of transducers and related transceiver channels. As the resulting beam pattern is the product of the transmis PROCEEDINGS OF THE IEEE, VOL. 88, NO. 12, DECEMBER 2000

(a) Fig. 12. Amplitude and range images acquired by the lens-based system developed in the Draper Lab, Cambridge, MA. The scene is a pile of cinder blocks, about 2 m far from the lens.

(b) Range image associated, point by point, with the amplitude one. The gray level of each pixel depends on the range distance related to the peak whose amplitude is given in the range image.

21 (a) Fig. 12. Amplitude and range images acquired by the lens-based system developed in the Draper Lab, Cambridge, MA. The scene is a pile of cinder blocks, about 2 m far from the lens. (a) Amplitude image obtained by projecting the value of the peak of each signal received by the retina sensors; dark pixels represent high-amplitude peaks. (b) Range image associated, point by point, with the amplitude one. The gray level of each pixel depends on the range distance related to the peak whose amplitude is given in the range image. Darker pixels represent points closer to the lens. (b) sion beam pattern by the reception beam pattern, by using only transducers one can obtain the same angular resolution as given by a dense receiving array composed of transducers, as can be deduced from (26). Two real systems based on this concept are described in [86], [87], together with their performances. More in general, if the acquisition time is not a problem, it is possible to collect 3-D information by scanning a single beam over the volume of interest [88]. The pencil-beam sonar is a system that works in this way by mechanically steering a narrow beam to both insonify and receive the acoustic response of a given direction. Another possible approach to gather 3-D information relies in exploiting the movement of the platform during navigation. The multibeam echosounder is a well-known sonar system going in this direction, commonly used to produce 3-D elevation maps (bathymetry) of the sea bottom [8], [24], [89]. A bathymetric map is useful for many applications, ranging from marine geomorphology, harbor and navigation channel survey, offshore mining, to vessel navigation and repositioning. The sonar system is mounted below a ship hull (or below an underwater vehicle) and works only on the vertical plane perpendicular to the motion direction (navigation track). The line representing the intersection of this plane with the sea- bottom surface is the bottom profile that a multibeam echosounder tries to acquire. By putting together a sequence of these profiles taken at different positions in the motion direction, it is possible to arrange a 3-D bathymetric chart. At a given time instant, a pulse insonifying a narrow strip on the sea bottom (i.e., the profile to be acquired) is emitted and the backscattered echoes are received by a linear array that is perpendicular to the motion direction. Signals are processed by beamforming, so that each beam provides the measure of the bottom distance in its steering direction, and the bottom profile can be derived from a well organized set of adjacent beams. Fig. 14 shows a bathymetric map acquired by a high-frequency multibeam echosounder, the SeaBat 9001 produced by Reson A/S, Denmark, and working at 455 khz and 140 m of depth range. The resolution and quality of the image are quite similar to those of the 3-D systems discussed in this paper. Finally, one can be useful to mention two sonar systems typically providing only 2-D information, but largely employed in actual underwater operations. The first one is the sidescan sonar (also called side-looking sonar) [8], [24], [89] that is a system devoted to generate an image of the sea bottom area on the right (or left) of the navigation line, mainly for classification purposes. A pulse (typically called ping) is emitted that insonifies a narrow strip of the bottom on the right (left) of the platform and perpendicular to the navigation track. The backscattered signals are acquired by a linear array steered in the same direction; thus, the obtained signal represent the acoustic response coming from such a strip and each time instant is referred to a given distance. Collecting together many of these signals acquired during the navigation, it is possible to arrange a 2-D image of the sea bottom. For both sidescan and multibeam echosounder systems, the maximum range (many hundreds of meters) and the maximum carrier frequencies (some tens of kilohertz) are quite different from those of the systems previously described, and the aims are dissimilar too. Therefore, this kind of systems is considered only marginally in the paper. The second one is the forward-looking sonar that is a multibeam system looking in front of the platform [8], [24]. MURINO AND TRUCCO: THREE-DIMENSIONAL IMAGE GENERATION AND PROCESSING IN UNDERWATER ACOUSTIC VISION 1923

(a) (b) (c) Fig. 13. 3-D images acquired by the lens-based system developed in the Naval Research Laboratory, Washington, DC, and at the University of Washington, Seattle, WA.

(b) 3-D rendering obtained by merging four passes, like that shown in panel a, for different angles between the ROV axis and the lens track. (c) Picture of the imaged ROV.

In this way, by means of a single pulse and without the need of any platform motion, a 2-D image can be arranged that represents the response of the scene in front of the ship and can be very useful

A portion of the Bonneville Dam Stilling Basin is shown, where the concrete base and the concrete baffles can be clearly seen. Also the effects of the concrete erosion are visible.

22 (a) (b) (c) Fig D images acquired by the lens-based system developed in the Naval Research Laboratory, Washington, DC, and at the University of Washington, Seattle, WA. (a) 3-D rendering of the acoustic image obtained inside a tank when the acoustic lens-based system was moved along a straight path over a 1.6-m-long ROV that was 5 m far from the lens. (b) 3-D rendering obtained by merging four passes, like that shown in panel a, for different angles between the ROV axis and the lens track. (c) Picture of the imaged ROV. perpendicularly to the navigation track, and beamformed over a set of adjacent directions (see Fig. 7 for an example). In this way, by means of a single pulse and without the need of any platform motion, a 2-D image can be arranged that represents the response of the scene in front of the ship and can be very useful in any obstacle avoidance or object detection task [90]. Fig. 14. A 3-D bathymetric map acquired by using the SeaBat 9001, a multibeam echosounder system produced by Reson A/S, Denmark. A portion of the Bonneville Dam Stilling Basin is shown, where the concrete base and the concrete baffles can be clearly seen. Also the effects of the concrete erosion are visible. The image has been taken from By a single pulse emission, an angular sector in front of the platform is insonified that is wide in the horizontal plane and narrow enough in the vertical one. The backscattered echoes are received by an horizontal linear array, mounted VI. THREE-DIMENSIONAL ACOUSTIC IMAGING IN MEDICAL APPLICATIONS In this section, we present conceptual analogies of acoustic imaging in underwater and medical applications, focusing attention on 3-D imaging, and we highlight several differences in both imaging aims and commonly adopted procedures. We simply point out the major problems and a few potential solutions, referring the reader to the appropriate literature for greater details. Acoustic images, as generally used in medical echography [20], [27], [91], are 2-D and represent a cross section of a given part of the human body. The brightness of each pixel is 1924 PROCEEDINGS OF THE IEEE, VOL. 88, NO. 12, DECEMBER 2000

23 proportional to the acoustic reflectivity of the body tissue in the related resolution cell. A linear array is sufficient and typically used to transmit and receive in connection with a beamformer exploited in both directions. Many adjacent beams are emitted in time sequence; for a given beam transmitted by the array, the backscattered signals are gathered by the same array and beamformed with a steering angle equal to that used for the transmission. After the application of appropriate time-varying gains (TVGs), the magnitude of a beam envelope over time depends on the reflectivities of the tissues and organ boundaries that have been met in the steering direction. Due to the short distances involved, the collection of more than 100 beams requires a few tens of milliseconds [92], thus allowing the generation of a cross-sectional image in real time. The double beamforming process yields an improvement in the lateral resolution, decreases the sidelobe level, and increases the signal-to-noise ratio; these three results are very important to obtain a useful and understandable representation of the human body. In underwater imaging, a single pulse is generally used to insonify the whole scene to be imaged, and the beamforming process is applied only to signal reception. This is due to the different imaging aims: to take a measure of the tissue reflectivity profile in the beam direction in medical applications; to measure just the distance and the reflectivity of the closest surface in the beam direction in underwater applications. The need of the double beamforming process increases the difficulty in moving from 2-D imaging ( beams) toward 3-D medical imaging ( beams), as real-time imaging is not feasible any more. For measurements to be made at 15 cm depth, 200 s per beam are necessary, and 3.2 s have to be spent to perform a scan [92]. A possible tradeoff is to emit wider beams to insonify larger areas (e.g., instead of ) and, for each emission, to compute all the received beam signals related to the insonified area ( in the given example). Real-time imaging is possible if reception beams are computed in parallel, but there are losses in both lateral resolution and sidelobe rejection. Two innovative and interesting approaches have recently been proposed in [93] and [94]. According to the first approach [93], [95], a set of independent waveforms (pseudorandom codes) are simultaneously emitted over a large area where they interfere. The receiving array employs a conventional beamformer followed by a bank of transversal filters, each associated with the echo expected from a specific direction inside the insonified area. In this way, a lot of image lines can be computed in parallel for each emission, the range resolution is improved, the lateral resolution is better than that of systems using double beamforming, and sparse arrays can be adopted. In the second paper [94], the theory of limited diffraction beams [39], [96] is exploited: just one plane wave is transmitted by a 2-D array to insonify a scene, and echoes are received by the same array but weighted to produce limited diffraction responses. This method achieves the same lateral and range resolutions and the sidelobe rejection as those of a conventional double beamforming, but, as only one transmission is sufficient and all computations can be performed by the FFT, real-time operations can be easily accomplished. However, only the scene insonified by the plane wave can be imaged, thus the lateral dimension of the image is bound to that of the array. Another crucial problem is the realization of a 2-D transducer array, due to the high costs of transducers and related I/O channels (they should be wide-band ones to allow a system to have the fine range resolution required in the medical field). Moreover, a 2-D array should be handheld, but this is not allowed by present connection cables, thus requiring the integration of a multiplexer together with the array [84], [92]. A few strategies to reduce the number of array elements have already been described in Section IV, like the generation of a very sparse array (used to both transmit and receive) [38], [56], [58], [97], or the combination of two smaller sparse 2-D arrays (one to transmit, the other to receive), where the element positions are chosen at random [98] or they are periodic but different for the transmitting and receiving elements [68]. In underwater imaging, one expects 3-D images to be visualized in such a way as to clearly show the object surfaces that are first met along the steering lines, regardless of the internal structures of the objects or despite the objects occluded by those actually visualized. In medical imaging, this is possible only in a few specific cases (e.g., [99]) as often there is not a clear object surface. A 3-D image can be displayed as a pack of 2-D orthoscopic images, each of them representing the acoustic intensity backscattered by a plane parallel to the array and at a given distance from it. This procedure is called C-mode (or C-scan) to differentiate it from the traditional cross-sectional imaging that is called B-mode [27]. More generally, the visualization of a full 3-D volume is preferable, although often difficult to understand [100], [101]. To this end, a lot of works are currently in progress; a recent paper by Hernandez et al. [101] presents a wide panorama of the attempted techniques and proposes an interesting stereoscopic visualization. Due to the above-mentioned difficulties, not many prototypes of real-time 3-D imaging devices based on ultrasounds have so far been produced. Of particular importance is the machine developed at Duke University [62], [100]: it includes a array (one fraction of them is used to transmit and the other to receive, with different geometries), steering directions, eight images/s, and an online display of volumetric images. Such a machine has been updated several times [98], [102], and some research teams are working to build larger effective 2-D arrays [84], [102]. 3-D acoustic images are currently produced by packing many 2-D cross-sectional images acquired by using the traditional B-mode scanning. There are different ways [101], [99], [92], [103] of mechanically moving (rotating, translating, or rocking) a linear array, acquiring a set of appropriate slices, and fusing them together. However, real-time imaging is not feasible as each 2-D image requires about 20 ms. Recently, Lockwood et al. [103] proposed an interesting way of mechanically moving a linear array, emitting just one signal from each position, acquiring all the echoes and, at the end, producing a 3-D image in real time by exploiting the synthetic-aperture concept. MURINO AND TRUCCO: THREE-DIMENSIONAL IMAGE GENERATION AND PROCESSING IN UNDERWATER ACOUSTIC VISION 1925

24 VII. IMAGE REPRESENTATION AND LOW-LEVEL PROCESSING The type of image representation derived from raw signals is an important aspect, as it can favor subsequent information extraction processes. Moreover, whatever image representation is adopted, a preliminary data processing stage is always present in every acoustic imaging system in order to perform basic operations useful to improve image quality and human understanding. These methods are normally simple (e.g., thresholding) and should be characterized by a relatively fast time response. In some cases, nonstandard image formation processes are proposed, able to directly generate images of better quality. Such methods can be considered as early approaches to image restoration possibly applied directly to signals, rather than to generated images, in order to reduce degradations due to speckle noise and sidelobes. In this section, image formation techniques and early processing methods, like thresholding and filtering, are described and evaluated. However, due to the variety of sensorial devices and applications, no standard reference data are available, thus an objective comparison of the methods is not possible, but only individual evaluations of the single techniques can be carried out. A. From Resolution Cells to Images In the previous sections, several types of systems for the formation of acoustic images have been described. A holography-based system directly extracts a set of images (organized into concentric spherical layers) representing the acoustic responses of a scene at specified distances. The lens-based and beamforming systems provide a set of acoustic signals, each representing a backscattered echo coming from a specific direction. Regardless of the imaging system used, the obtained 3-D information is organized into a dense lattice of resolution cells of different dimensions (see Fig. 7) to cover the whole volume of interest. Essential information about each cell lies in the coordinates of the cell center and in the acoustic amplitude or intensity (proportional to the reflectivity of the scene inside the cell itself). This representation of information is not always efficient for subsequent postprocessing or visualization; moreover, the coordinates of cell centers are often expressed by polar coordinates. A first process, sometimes performed to achieve an initial cleaning of an imaged volume, discards all the resolution cells whose acoustic amplitudes do not exceed a given threshold. The aim is to remove the effects of electronic noise, in particular, of the sidelobes in the beam pattern of the acoustic system. To display on a screen (by a 3-D rendering method) the resolution cells retained after the thresholding operation may result in a coarse visualization of the external surfaces of the objects contained in a scene. Independently of the threshold value, the resolution cells can be projected on a regular 3-D grid of voxels (volume elements) of constant dimensions in a Cartesian coordinate system. This operation is called scan conversion. To avoid loss of resolution, each voxel should be smaller than the smallest resolution cell to allow one or more voxels to be contained in a given resolution cell. To this end, there are two possible approaches: 1) searching for the voxel containing the center of a given resolution cell, assigning the acoustic amplitude of the cell to such a voxel, repeating this procedure for all the cells, and, finally, interpolating in order to assign a value also to each voxel that does not contain a cell center; 2) computing how many voxels are inside a given cell on the basis of the cell dimensions, assigning the acoustic amplitude of the cell to the voxels, repeating this operation for all the cells, and, finally, checking if some voxel is unassigned. The resulting regular grid of voxels may be a useful starting point for many postprocessing purposes, and also facilitates the following operations: 1) extracting a sequence of 2-D images representing the acoustic responses of successive planar slices of a scene and 2) extracting several 2-D images cut with different orientations, and projecting the 3-D volume by performing an orthogonal or perspective projection integrating the acoustic amplitudes of the cells met along the projection line. An alternative and more compact way of exploiting the collection of resolution cells is to use a couple of 2-D images, perfectly registered (pixel by pixel), one for amplitude, the other for range information (see an example in Fig. 12). The implicit consideration is that we are interested in the first object surface that is met starting from the array center and following a given steering direction, thus disregarding what is behind such a surface. If this is the case, we expect to extract only two values from each beam: the distance of the surface (if any) and the amplitude of the acoustic response. A common method to measure the distance of a scattering object is to search for the maximum peak of the beam-signal envelope [87], [43], [104]. Denoting by the envelope of beam signals steered in the direction, potentially after a matched filtering, and by the time instant at which the maximum peak occurs, one can derive the related distance,, from simple geometrical considerations, and the related acoustic amplitude,. The amplitude value is regarded as an estimator of the reliability of the range measure and can be actively exploited [105]. Besides, the whole amplitude image can be used to estimate the 2-D shapes of the objects present in a scene. Another possible selection method is to search for the first peak exceeding a given level, instead of searching for the absolute maximum peak. More accurate methods of detection and reliability estimation could be devised, but they would not be applied in 3-D underwater imaging [43], [106]. Therefore, for each steering direction, a triplet ( ) can be extracted, so 3-D data reduce to a set of triplets, the number of which is equal to the number of beam signals. The set of triplets can be visualized on a monitor, as shown in Fig. 11, where only the triplets whose values exceed a given threshold are retained or projected on two orthoscopic images parallel to the plane of the array. Each of the two images is characterized by a regular grid of pixels whose dimensions are smaller than those of the smallest resolution cell, thus allowing the projection of each triplet to incorporate one or more pixels, as depicted in Fig. 15. The first image is a range image in which 1926 PROCEEDINGS OF THE IEEE, VOL. 88, NO. 12, DECEMBER 2000

25 Fig. 15. Sketch of a 3-D imaging system in which two images (range and amplitude) are obtained starting from the peaks of the envelopes of beam signals. the polar point defined by and is converted into a Cartesian point ( ); the second image is an amplitude (or intensity) image in which the measure (or ) is associated with the pair ( )togive( ). An example of such projection is given in Fig. 12: a linear relation links the largest range value to white pixels and the smallest range value to black pixels. In the amplitude image, the darker the pixels, the higher the confidence. In the foregoing, we have assumed the use of beamforming, but the same procedure can also be applied to the signals acquired by the sensor retina of a lens-based system [43] or to the signals derived by reading the sequence of values of the cells (met in a given direction) on the concentric spherical layers of a holographic system [6], [82] (i.e., the cells with the same subscript in the vectors ). At the end of Section V, we mentioned some sonar systems commonly used in practical operations, although they are not 3-D devices working in real time. Avoiding details, it is important to point out the following. Pencil beam sonar systems can produce both a range and an amplitude image of the scanned volume by an appropriate projection of gathered information in Cartesian coordinates. Forward-looking sonar systems produce an amplitude image in polar or Cartesian (by projection) coordinates, like that depicted in Fig. 7. The image is a top view of a scene, and shows the acoustic responses of the objects in it. Sidescan sonar systems typically produce only amplitude (or intensity) images of the sea bottom by projecting the collected signals on a seafloor assumed to be flat at a given depth. In fact, the acquisition geometry is based on the oblique incidence of the acoustic beam on the sea bottom. An object placed on the bottom produces a stronger echo and, behind it, a region of acoustic shadow of an extent proportional to the object dimensions. Sidescan systems can be enforced by adding a second linear array, parallel to the first one, that allows us to obtain also a bathymetric map (range image) thanks to an interferometry-based processing [89], [107]. In this case, two registered images are generated: an amplitude image and a range image. Multibeam echosounders typically produce bathymetric maps that are range images on a large scale. However, by adequately processing the computed beams, it is possible to arrange also an amplitude (or intensity) image of the investigated sea-bottom area, similar to the typical images provided by sidescan systems. In this case, two registered images are generated: an amplitude image and a range image. Greater details, a discussion, and a comparison related to this topic can be found in [89]. Finally, in both sidescan sonar systems and multibeam echosounder systems, images are formed line by line while the platform is moving; therefore, vertical and horizontal dimensions are different, and sensor position and attitude should be taken into account for image generation as well as subsequent data processing. B. Thresholding and Filtering Methods At present, different applications (e.g., object detection and recognition, seafloor bathymetry and classification) and different sensing configurations (multibeam, sidescan) should face different problems that preliminary processing procedures can alleviate or fully solve. In general, there is the need to disregard clutter scatterers, to remove sidelobe interferences and noise, to compensate for the angle of incidence of acoustic waves, for sound velocity, and for the position and attitude of the sensor carrier [87]. Some of these problems (e.g., sound velocity profile estimation, ship attitude and motion compensation) are not considered in high-frequency short-range imaging, but only in longer range sonar systems. In these cases, image processing techniques to improve amplitude/intensity image quality are nowadays commonly utilized, mainly aimed at speckle reduction and contrast enhancement, normally preceded by geometrical corrections. Speckle is typically tackled by applying smoothing finite impulse response (FIR) filters of appropriate size (from to ), so each pixel is restored by a weighted linear combination of its neighbors. Unfortunately, the application of noise removal and smoothing techniques usually leads to blur an image, hence, to reduce information useful for interpretation. More information about these techniques can be found in [108] [110]. MURINO AND TRUCCO: THREE-DIMENSIONAL IMAGE GENERATION AND PROCESSING IN UNDERWATER ACOUSTIC VISION 1927

26 In general, as mentioned in the previous sections, the first and simpler way to discriminate between actual backscattered echoes and clutters is to properly determine a threshold level. In this way, it is heuristically assumed that echoes have stronger (or different) response than other cluttering interferences, even this is not always true. This problem is the first encountered in acoustic image processing and understanding, and also involves the detection of entities of interest, namely, targets. At a very early stage, detection is typically performed by looking for the most significant amplitude peaks of the beam signals from which images are derived. More formally, in probabilistic terms, peak detection is generally performed by using the classic detection theory capable to model the involved signals (noise and signal plus noise) as probability density functions (PDFs). In summary, from the estimated or defined PDFs, a threshold can be chosen in order to maximize the probability of target detection,, or to minimize the probability of a false alarm,. Defining as the PDF of noise alone, as the PDF of signal plus noise, and as the signal (after beamforming, for instance), we have (46) After fixing, a tradeoff among the signal-to-noise ratio (SNR), and is critical for any reliable operation using sonar systems (other methods can be used to model the problem [111]). In practice, an adequate thresholding constitutes per se a good detection method, but more sophisticated approaches should be employed to identify entities of interest, depending on the application and the type of sensor used. Typical examples can be the detection of a significant depth discontinuity in a local area, or different backscattering responses (i.e., backscattering strength, also used to classify a target [10]). This initial thresholding process has been widely applied in underwater 3-D imaging [14], [43], [87], [104] [106], [112] and also in some medical-imaging cases [94]. The threshold value is typically fixed at the level of the highest sidelobe, equal to 13 db when unitary weights are applied; therefore, all the cells with an amplitude smaller than 22% of the maximum one are discarded. Of course, this kind of processing is applied directly to beam signals and is rather drastic, so as to eliminate also real echoes, other than clutters. To avoid that a few very high responses, due to a specular effect, may produce too high a threshold, the maximum amplitude can be computed as the average over a certain percentage of cells having highest amplitudes. This process is similar to the application of an adequate threshold filter to a generated amplitude image, in such a way that low-valued pixels are switched off and only pixels above the threshold are visualized (together with the related 3-D measures, if a range image is also provided). As a result, a good thresholding and proper image representations may allow a direct analysis of the shapes (2-D and possibly 3-D) of the objects present in a scene. An interesting example is presented in [87], in which the problem is the measurement of the seabed topography by a multibeam sonar. Sea-bottom echoes and reverberation are separated by setting a threshold estimated by calculating the intersection of the probability density functions of two distributions, assumed Gaussian for echoes and Rayleigh for reverberation. This approach follows practically the same scheme described above and derives from the Bayes classification theory aimed at minimizing the classification error when a two-class problem is faced. Another interesting approach is proposed in [90], according to which a high-resolution sonar is utilized for autonomous underwater vehicle (AUV) navigation. In this work, a set of operations are performed to reduce the amount of data and to remove noise and sidelobe interference efficiently for the purpose of fast object detection and tracking. Two thresholds, and ( ), are set for object detection, and then, along each beam signal, echo amplitudes larger than the higher threshold (strong echoes) are regarded as belonging to an object. Echo amplitudes larger than the lower threshold are considered as belonging to an object only if they are close (adjacent) to a strong echo (already classified). Echoes below the lower threshold are discarded. The method is quite simple and efficient and has been derived from a hysteresis criterion, also used in other contexts (e.g., edge detection [113]). Another simple method consists in using mask filters, i.e., heuristically defining a threshold based on the difference between each pixel and its neighbors, but, obviously, this approach exhibits severe limitations and provides low performances. In general, several mask filters may be systematically applied to images so as to evaluate the related results qualitatively and quantitatively: square average (low-pass) filtering, square Gaussian weighted average filter, square selective average filter, square K-nearest neighbor average filter, square normalized inverse gradient weighted average filter, square median filter, and cross median filter [108]. In practice, such filters perform a smoothing process by replacing the value of the pixel under examination with different kinds of averaging neighboring pixels, often specifying some thresholds manually. These filters are typically used to intensity images, but, in some cases, they can also be applied to related range images. Median filters, also used for optical images, constitute a good method to reduce noise while preserving high-frequency information. Contrast enhancement methods can also be adopted to improve visual quality and facilitate subsequent processing: Laplacian, gradient, and similar operators are useful to improve contrast. Statistical approaches can be also applied (disregarding the specific acoustic system) to measured data directly to estimate actual information. This methodology consists in statistically modeling the physical acquisition process, and consequently using adequate techniques for inversion to remove noise (restoration process [114]). As an example, sidescan sonar image restoration is addressed in [115], in which the actual reflection characteristics of the seabed are statistically estimated by modeling the related image as a Markov random field (MRF). MRF [116] approaches are characterized by the capability to model the specific acqui PROCEEDINGS OF THE IEEE, VOL. 88, NO. 12, DECEMBER 2000

27 sition process (observation or sensor model) and the a priori information available (a priori model) by probability distributions. The two models are then transformed and merged into a single cost functional, called energy, the minimization of which leads to the optimal estimate in some probabilistic sense [typically, maximum a posteriori probability (MAP)]. In [115], a cost functional is defined over the image, taking also into account possible discontinuities contained in it (to avoid oversmoothed results). As a consequence, a classic energy formulation is proposed that considers Gaussian sensor modeling and a smoothing edge-preserving term. Actual restoration is obtained by minimizing such energy by a stochastic (simulated annealing) or deterministic [iterated conditional mode (ICM)] procedures, thus leading to a MAP probability solution. As expected, simulated annealing allows one to get better results but at higher computational costs, and ICM provides good results only if the initial estimate is close to the solution. From this brief analysis, it can be noticed that several problems affect acoustic image formation and quality improvement, and that a set of effective techniques can be applied, ranging from simple ones (thresholding) to more complex methods. Actually, in terms of noise reduction and quality improvement, FIR filters and thresholding have proved to allow a good compromise between computational complexity and performances; whereas, if accurate image restoration is the main objective, methods involving statistical modeling yield better results, but at higher cost. The above methods try to improve quality of generated images, but they are not affecting directly the image formation processes for this purpose. Other methods try to influence or adjust such processes in order to obtain directly good-quality short-range acoustic images [106], [117]. These techniques can be considered as near-sensor approaches as they are able to obtain directly restored data of better quality from the image formation procedure, taking also into account the particular geometry of a sonar system. In [106], the concept of confidence is introduced for the generation of acoustic images while improving their quality. Starting from the beamforming process (for short-range imaging applications), confidence information allows a rapid examination of beam signals, that is aimed at accurately detecting possible target echoes backscattered from a scene, while disregarding useless information. In other words, confidence is measured directly from the beamforming process and can be considered as an ideal interface between beam signals and the related image, for it operates directly on the image-formation process, with only a slight increase in the computational load. More precisely, the reliability of the spatial information extracted from each beam signal is considered as a sort of signal-to-noise ratio that does take into account how specific information is extracted from beam signals. Therefore, the idea consists in estimating the correct distance of an obstacle (or of the various layers of a scene) by estimating the reliability with which, at every sampling instant, the beam signal examined can be compared with a possible (model-based) echo of the transmitted pulse. To this end, the mean square distance between the beam signal envelope at the beamformer output and an expected beam signal based on the modeling of the scene is estimated. Under suitable hypotheses and approximations [106], the shape and the amplitude of the expected envelope of the backscattering signals vary only slightly, for not too different object distances, hence, a constant shape and a varying amplitude are assumed. Therefore, a collection of expected envelopes can be preliminarily constructed, according to the distance of a possible target, to the material the target is made of, and, if necessary, to the characteristics of the propagation medium. Two confidence levels are proposed. The first, named 1, is used to make a direct comparison between the envelopes of the expected beam signal and the real one. As a result, a local signal-to-noise ratio is approximated, in which the signal is represented by the real envelope and noise is represented by the difference between the expected and real envelopes. The second level, named 2, has been introduced to overcome the problem of an inaccurate knowledge of expected amplitudes, so the previous measure is modified to account for separate information on the envelope shape and on the envelope amplitude. In Fig. 16, two 3-D plots related to a synthetic scene containing three objects of different shapes are shown. They represent the elevation maps derived by the normal [Fig. 16(a)] and the enforced [Fig. 16(b)] image generation processes; the latter process has used 2 confidence level. It can be noticed that the quality of the second map is quite good and better than that of the first map, and also the range accuracy is improved: the mean square error (MSE) computed along the range direction is reduced of about 33%. An interesting application relates the accurate localization of targets in images acquired by a 3-D multibeam sonar (FishTV [117]). The problem is regarded as a parameter estimation process, where the parameters are the position and the target strength [10] of the target. A least square criterion is applied that minimizes the distance (residual) between observed and predicted data obtained by using an accurate physical model taking into account the geometry of the sonar system. The minimum of the residual provides the optimal parameters values in the maximum likelihood sense. This method is computationally efficient and has been tested on simulated noise-free scenes. The displayed error surfaces computed over the entire parameter search space prove an elevate degree of accuracy, for either target position, and a potential robustness to noise. Although this method is not completely devoted to image formation, but more to target localization, it is cited here as a technique that improves and facilitates next processing stages (in this case, target tracking) by incorporating sonar system characteristics directly in the solution scheme. VIII. IMAGE SEGMENTATION, RECONSTRUCTION, AND INTERPRETATION After image formation and filtering, more structured postprocessing methods can be applied, especially segmentation and reconstruction techniques for high-level tasks, like classification and (object) recognition. Typical applications MURINO AND TRUCCO: THREE-DIMENSIONAL IMAGE GENERATION AND PROCESSING IN UNDERWATER ACOUSTIC VISION 1929

28 (a) (b) Fig. 16. (a) 3-D elevation map of the scene containing three objects: the map was obtained by means of the absolute maximum amplitudes (the points generated by maximum amplitudes below a threshold, i.e., 10% of the total range, were translated into the background). (b) 3-D elevation map of the scene containing the same three objects: the map was obtained by the CL2 confidence measure (the points generated by confidence values below a threshold, i.e., 10% of the total range, were translated into the background). using these techniques include scene reconstruction and interpretation, object detection and recognition, seafloor classification, and texture analysis. A large literature is oriented to robotic applications, in which methods for (autonomous or tethered) vehicle navigation, path planning and obstacle avoidance are the most widely utilized. A typical data processing scheme, from image formation to information visualization, is sketched in Fig. 1, and the meanings of the several phases and related techniques are summarized in Table 1. This hierarchical framework is to be interpreted, at a coarse resolution, as subdivided into low-, middle-, and high-level processing, as in classic vision approaches [7]. After image formation and low-level filtering (middle-level) segmentation or reconstruction is applied, depending on the type of data, to identify interesting image areas. After segmentation, scene understanding is obtained by a recognition or classification stage (high level), or, alternatively, information can be visualized for an immediate and easier human understanding. As a consequence, the section has been structured in these two data processing categories. A variety of techniques have been proposed in the literature to tackle the aforesaid applications, and although standard algorithms are not identifiable, statistical approaches can be recognized as the most commonly used in the literature, especially for segmentation problems. In this section, we analyze all such techniques, which operate on both 3-D and 2-D acoustic data, giving special emphasis on the methods devoted to segmentation, reconstruction, and object recognition, and on the algorithms whose theoretical structure is sufficiently general to be also applicable to scene interpretation problems. Also in this case, no standard data exist that allow algorithms to be quantitatively compared, so only subjective evaluations of the performances of the single techniques and qualitative comparisons can be carried out PROCEEDINGS OF THE IEEE, VOL. 88, NO. 12, DECEMBER 2000

A. Segmentation and Reconstruction Several segmentation methods have been developed according to different kinds of acoustic data (2-D or 3-D) and to the availability of additional information or

In general, segmentation is considered as a process that groups sets of pixels (regions) having similar characteristics, associating a label to each different region.

29 A. Segmentation and Reconstruction Several segmentation methods have been developed according to different kinds of acoustic data (2-D or 3-D) and to the availability of additional information or simplifying hypotheses (e.g., flat sea bottom). In general, segmentation is considered as a process that groups sets of pixels (regions) having similar characteristics, associating a label to each different region. The set of labels may have a semantic meaning (e.g., seafloor type), and, in this case, we can also perform classification, or simply discriminate among different regions. Reconstruction techniques can be used whenever 3-D information is available in order to exploit also this kind of data to obtain a more concise and significant representation (e.g., polynomial function of a surface) of a scene. For these issues, an important and effective branch of the research is devoted to statistical approaches, typically, MRF models, which are able to take into account for the physical characteristics of the acoustic process in the model. Another important aspect of this research concerns geometric methodologies, which utilize 3-D data processing to get a significant representation of the entities present in a scene. 1) Segmentation: In sonar image processing, MRF features can be exploited in different ways. In some cases, both the image formation procedure and the segmentation criteria are incorporated in the model; in other cases, MRFs are used a mean to homogenize the regions extracted from a preliminary segmentation. In [118] and [119], MRFs are used for the segmentation of multibeam echosounder images of large seafloor areas. A reliable geological interpretation of such a kind of sonar data requires postprocessing and signal correction [120] to remove artifacts, consisting in inaccurate range estimates in the different directions, and in a wrong backscattering strength [10]. The backscattering strength (BS) is a characteristic of the reflected energy from insonified areas, and depends on the target and the angle of incidence of the impinging acoustic wave. For these reasons, the BS is estimated to discriminate among different seafloor types, but a reliable classification is not easily obtainable. In [118], the MRF approach was used to model: 1) the sensor antenna s directivity, and 2) data tested experimentally on a flat seabed. First, the backscattering strength was estimated by using the distribution as a model. Second, MRFs are exploited to segment the image resulting from the mosaicking of the several stripes of the seafloor, previously organized by a cartography expert. To this end, an energy functional is defined that is composed of a term taking into account the spatial interaction between labels and of another term taking into account the distribution of observations. Due to the particular geometry of the acquisition process, a specific neighboring set was devised to account for the different actual dimensions of the image in the horizontal and vertical directions. The interaction between labels is based on such a neighboring set, and the energy term is set by assigning different weights to equal label configurations of neighboring pixels, whereas a null weight is assigned in (b) Fig. 17. (a) Example of mosaic of the seafloor with several types of bottom. (b) Results of the segmentation phase. (a) the case of different labels. Observation modeling is based on a weighted distribution, and also on the criteria used to build the mosaic by introducing a deformable neighborhood system. Such a system is able to exploit the different incidence angles along a stripe; therefore, a pixel labeling based on different BS profiles is feasible. This algorithm requires a manual segmentation provided by a geologist, bathymetric data, and other information to derive the initial backscattering strength. A typical result derived from real data is shown in Fig. 17, which displays: 1) original mosaicked data and 2) segmentation results. The resulting segmentation complies with geological delimitations and is of great support to geologists for seafloor exploration and interpretation. In [121], another segmentation algorithm for seafloor characterization from multibeam echosounder images is presented. Also in this case, an image is partitioned considering the angular variation of the backscattered strength as a function of the type of seafloor. Actually, the real incidence angle of the BS is estimated taking into account the local seafloor slope, i.e., backscattered signals, bathymetric data, and sensor characteristics are considered for an effective image segmentation. Then, segmentation is obtained by comparing estimated BS values with a small set of BS variation laws (manually determined by experts using raw images) and by submitting this early result to a regularization algorithm (MRFs) to make segmentation locally homogeneous on the basis of estimates of neighbors. This method is applied to intensity sonar images and makes use of 3-D and other information to estimate the BS and then to segment an image. Although this method is formally correct, its main drawback lies in the need for information not always easily available; moreover, no ground-truth data are MURINO AND TRUCCO: THREE-DIMENSIONAL IMAGE GENERATION AND PROCESSING IN UNDERWATER ACOUSTIC VISION 1931

required for the good functioning of the algorithm. In a series of works [105], [122] [125], reconstruction and segmentation processes of 3-D acoustic images are addressed by using MRF models.

30 provided to evaluate its accuracy. Actually, each step needs data potentially affecting the final result: a bathymetric map, precise system characteristics, a geologist s subjective segmentation to derive BS variation laws are all required for the good functioning of the algorithm. In a series of works [105], [122] [125], reconstruction and segmentation processes of 3-D acoustic images are addressed by using MRF models. In these works, the usefulness of exploiting confidence information (considered as the backscattering intensity associated with each 3-D measure) during the estimation process is stressed, proving the effectiveness of the approach. In summary, associate range and confidence (i.e., amplitude or intensity) images are considered as noisy samples of true data, and, regarded as coupled MRF models, they are jointly restored to reduce the degrading effects due to speckle and sidelobes by using several theoretically justified energy functions. The main feature of the approach is the definition of a single energy function that takes into account sensor and a priori models concerning the two random fields; in the two models, the terms related to range-image reconstruction are modulated by proper functions dependent on confidence data. In a first formulation [123], the energy function was defined by simply assuming 3-D measures associated with high confidence to be reliable and discarding low-confidence range points. Then, this criterion evolved into the definition of a complex functional that takes into account the physical significance of the coupling terms between 3-D and confidence images dynamically, utilizing reliability data [105], or statically, once the images have been preliminarily restored [122]. In all the proposed models, confidence is first restored and used for reconstruction: in general, estimated confidence values operate both to reinforce the closeness to observation constraints and to prevent smoothing at object boundaries. The minimization of the energy functional leads to the optimal estimation of the field pair in the MAP probability sense. In later works [124], [125] other energy formulations are proposed in order to consider region and line processes so that a more accurate reconstruction can be attained as well as the extraction of object boundaries. Results are compared with those of MRF-based and classical methods showing the better performances obtained by using confidence in an active way. Using a synthetic image as a test example, the MSE is resulted about halved with respect to that derived by other methods, and the segmentation error is also improved of about 44%. They represent very good results, as compared with those provided by other methods for the same kind of images. In Fig. 18, the results of the algorithm for a complex real object are displayed. A scene composed of a set of cinder blocks is shown in Fig. 18(a) and (b), representing range and confidence images, respectively, and the related reconstructed and segmented data are visualized in Fig. 18(c) and (d), respectively. A synthetic scene is presented in Fig. 19, showing a rhomb (inclined with respect to the viewing direction of the sensor) with a hole inside [see Fig. 19(a) and (b) (a) (c) (d) Fig. 18. Results related to the real scene of the cinder blocks: (a) input noisy range image; (b) input noisy confidence image; (c) final reconstructed range image; (d) restored confidence image. for the model, and Fig. 19(c) and (d) for the simulated data]. One can notice, the reconstruction/segmentation results, together with the outcome of the edge-extraction process, which has succeeded in detecting object boundaries, despite the high percentage of noise and blurring degradation in the original images [Fig. 19(e) and (f)]. Several works have been devoted to the segmentation of 2-D images (e.g., sidescan ones) aimed at identifying three main classes: echo, due to an object of interest (target), shadow, produced behind the target, and reverberation, caused by bottom backscattering. In [126] and [127], segmentation of 2-D sonar images is addressed by using an unsupervised hierarchical MRF algorithm. Images of the sea bottom are segmented into two classes, shadow and seabed reverberation, useful for the detection and the classification of entities lying on the seabed. Gaussian and Rayleigh laws are used to model the shadow and reverberation classes (observation models), respectively, and clique potentials (i.e., weights assigned to particular configurations of labels in a neighborhood) are utilized as a prior model. The iterative conditional estimation (ICE) method is used first to estimate the model parameters, that is, the weights of the clique potentials and the parameters characteristic of the distribution laws. Then, by using the estimated parameters, multiresolution MRF models are defined at each scale. The process starts with the highest (coarse) resolution level, and only local (intralevel) neighbor interactions are allowed. The result of a higher-level MRF is propagated to the lower level as an initialization step down to reach the finest resolution level. Results are obtained from real data only, so the accuracy of the segmentation cannot be estimated quantitatively, but the estimated models for the two classes are (b) 1932 PROCEEDINGS OF THE IEEE, VOL. 88, NO. 12, DECEMBER 2000

31 (a) (b) (c) (d) (e) (f) (g) Fig. 19. Ideal images derived from the geometric scene model and acoustic images obtained by the beamforming, and related results of the segmentation algorithm: (a) ideal range image; (b) ideal confidence image; (c) noisy range image; (d) noisy confidence image. Results of the proposed algorithm: (e) reconstructed range image; (f) segmented confidence image; (g) related edge image. close to the actual distributions derived from the image histogram, and the visual quality of the resulting segmentation is good. A different statistical approach is followed in [128] for the segmentation of high-resolution sidescan sonar images. The usual echo, shadow, and reverberation discrimination process is carried out by using a cluster analysis technique, i.e., the fuzzy K-means algorithm. It consists in estimating a feature vector (pattern) for each pixel (actually, grey level and variance) and in defining a cost function taking into account the similarity of the patterns to K prototypes weighted with parameters indicating the degree of membership of a pixel in a given prototype. Prototypes are identified by their centroids, and the procedure is iterated varying the centroid locations and the membership parameters until they remain unvaried. Extending the process to an image sequence, a further step concerning the adaptation of centroids to adjacent images is performed by using simple rules compiled offline, which provide the offset to be applied to the centroid of each class from one image to another. Finally, a smoothing edge-preserving filter is applied to refine the segmentation: the filter is aimed at removing isolated pixels and adjusting the boundaries between classes. Other techniques for image segmentation are feasible like, e.g., spectral analysis [129], estimation of the PDFs associated to the different backscattering strengths [129], and especially texture analysis [130] [133]. In these cases, segmentation practically involves seafloor classification, as extracted regions are associated with different textural behaviors, i.e., different seafloor characteristics. Since these methods are mainly devoted to seafloor characterization, and cannot be easily utilized for object recognition and 3-D scene analysis (like MRF approaches), they are not addressed in this paper. 2) Reconstruction: The other large research area concerns geometric methods for image reconstruction that, in some cases, also implies segmentation. Several techniques have been proposed, depending on different data structures: surface fitting, volumetric processing, merging of different cross sections, and connected component analysis are among the methods devoted to the reconstruction of objects or scenes. The problem is here different from that faced in conventional range image processing (non-underwater) as resolution, data sparsity, and noise sensibly affect underwater acoustic 3-D data. In [134], a method to estimate the 3-D structure of an object by using several sonar (cross-section) images is presented. The method assumes that a forward-looking sonar pointed downward acquires a set of images around the object of interest. Reconstruction is considered as the estimation of the 3-D reflection function from its plane projections (47) where equations for the planes; azimuth angle; elevation angle; distance from the origin. Three-dimensional information is given by the combination of a 2-D reflection map,, estimated from echo information, and a 2-D elevation map,, estimated from shadow data. Therefore, segmentation of the images in three types of regions (echo, shadow, background) is first performed by using a couple of thresholds fixed on the basis of the a priori known sonar performances. Second, shadow MURINO AND TRUCCO: THREE-DIMENSIONAL IMAGE GENERATION AND PROCESSING IN UNDERWATER ACOUSTIC VISION 1933

32 information is utilized to compute the height of the object silhouette,, for a given azimuth angle (48) where height of the sonar; maximum distance of a shadow point; length of the shadow segment; across-range axis. As a result, the whole 3-D structure is obtained by a volumetric reconstruction that exploits occluding contours and combines all the estimated cross sections. The reflectivity map is estimated through the computerized tomography of the echo information by using the 2-D radon transform. Simulation results prove the feasibility of the technique, and the presented real example confirms its practical applicability. However, as this approach assumes that sonar images are acquired when positional information around the object of interest is known, its performances can be physically limited by typical sensor uncertainties and also by the possible inaccuracy of the preliminary segmentation algorithm. In [135], the reconstruction of sparse 3-D sonar data is performed by a surface fitting procedure. The idea is based on [136] and consists in grouping points likely to belong to a smooth surface up to a certain degree. The algorithm is essentially a region growing algorithm: a seed point is first selected and a certain number of neighbors are considered for the initial surface fit, as follows: (49) where is the set of coefficients of the surface. The surface order ranges from one to four and, starting from the plane, is increased every time the error on the current fit exceeds a prespecified threshold. Second, points are added to the current group as long as the error remains low; otherwise, a new seed point is chosen and the procedure is repeated until a certain percentage of the total number of points is labeled. In essence, if we denote by the th region containing a set of pixels,, and by the value of the pixel, the error of fit for the th region is calculated as follows: (50) where is a proper threshold and is the approximating surface. This means that pixels can be added to, represented by, whenever the error of fit remains below the threshold. Other types of error are introduced, and thresholds are empirically set and dynamically updated if the error of fit remains large. In [137] and [138], a geometric method for the reconstruction and segmentation of acoustic images is proposed that represents a further refinement of the one described in [135] and [136]. Starting from a set of 3-D points, each associated with a backscattered intensity, the method tries to discriminate between regions characterized by a common behavior identified by quadric surface coefficients. In sum- (b) Fig. 20. (a) Original set of 3-D data filtered by thresholding; (b) segmented 3-D data (numbers corresponds to label regions). (a) mary, high-intensity points are first selected to start with reliable 3-D measures; then, a quadric surface is fitted and other points are progressively added as long as the error remains small. The algorithm terminates when all the points have been assigned to a region. After the initial phase, refinement phases are subsequently performed to recover possible misclassifications of points. The final result is a segmented image where each different region is labeled and classified according to its geometric coefficients. Fig. 20 gives an example of the functioning of the algorithm, when applied to 3-D acoustic images representing a small part of an oil rig structure. As can be noticed, the algorithm has succeeded in identifying the tubular regions, but is affected by problems of under-segmentation. Actually, the poor range resolution tends to flatten the representation so that points may be arbitrarily added to wrong regions. Moreover, some points may remain unlabeled and the iterative procedure is computationally heavy. However, an accuracy analysis, performed directly on real images by estimating the length of the detected cylinders of the rig, shows average errors of the order of about 3% with respect to the actual length. A different approach is presented in [139] and [140], in which volumetric data processing is addressed. The considered application is the reconstruction of subbottom layers starting from data acquired by a proper (parametric) low-frequency sonar (TOPAS). Raw 3-D data are processed by three procedures, i.e., interpolation, segmentation, and visualization. The first step is necessary as raw data are quite sparse and a 3-D interpolation based on a 2-D multiresolution pyramidal approach is applied using quadtrees, which are able 1934 PROCEEDINGS OF THE IEEE, VOL. 88, NO. 12, DECEMBER 2000

Fig. 21. 3-D echo clustering (brighter cells correspond to stronger echoes) derived from sonar returns from a curved wall object [112].

33 Fig D echo clustering (brighter cells correspond to stronger echoes) derived from sonar returns from a curved wall object [112]. to interpolate the map accounting for the resolution of the data sparsity. In other words, a set of quadtrees are introduced to have a complete representation of data at each resolution level so that gaps may be filled at the proper resolution level according to their magnitudes. The segmentation phase makes use of octrees, the natural evolution of quadtrees into a 3-D representation. Also in this case, a multiresolution representation is adopted where data are smoothed as long as the resolution level decreases, and an unsupervised clustering technique is applied to partition the space into significant classes. Regions are subsequently refined by considering the region boundaries. Histogram modes are used as a criterion for smoothing, and boundary refinement is performed by estimating the 3-D orientation of the vector orthogonal to the boundary and by applying a 3-D filter for a precise estimation of the orientation. Finally, results are visualized by using a virtual reality modeling language (VRML) model. The work of Auran et al. [112], [141] addresses the reconstruction problem of sonar data by using an occupancy grid approach proposed in [142] for robotic applications. More specifically, forward-looking sonar data are processed to build a dynamic 3-D occupancy map where some useful information is stored in an adequate data structure. After preliminarily thresholding beam signals so as to retain all strong responses, a suitable (sparse) volumetric representation is generated in which the presence of a voxel implies the presence of a reliable (target presence) acoustic response in a given direction at some range distance. After the map formation, clusters are detected by a connected component analysis [141] in order to identify the main objects present in the scene, and geometric features (i.e., moments, radial and angular sizes, bounding box, area, volume, etc.) are extracted from the detected clusters and stored in the data structure. The subsequent steps aim at modeling and visualizing the detected objects by a surface fitting procedure and, if different views are available, a complete reconstruction can be obtained (after separate surface estimations) by merging surface patches. The approach results computationally efficient and useful for subsequent processing stages (e.g., object recognition). In Fig. 21, a typical representation of the generated clusters derived by echo returns is shown, and Fig. 22 reports the reconstructed object (a cylinder) obtained by surface fitting after fusing data acquired from several viewpoints. Two interesting works can be quoted as energy minimization methods aimed at 3-D map registration, and sea bottom localization. In [143], the problem of a bathymetric survey using a multibeam echosounder is faced. Typically, the survey of a large area is performed collecting several swaths of data, while the ship is moving straight in different directions in the same area; therefore, the swaths show a certain degree of overlapping. Of course, due to the inaccurate positioning of the ship, such swaths are misregistered. This work aims at registering the collected data in order to obtain a precise chart of the investigated area. Contours at a constant depth are used as features for the registration and MURINO AND TRUCCO: THREE-DIMENSIONAL IMAGE GENERATION AND PROCESSING IN UNDERWATER ACOUSTIC VISION 1935

34 Fig. 22. Cylinder reconstructed by the merging of the echo clustering (surface patches) taken from different viewpoints [141]. coded as chain codes for the local alignment between two swaths. A cost function is then defined in terms of translation, rotation and bending parameters, and is minimized to estimate the global positioning of the stripes. In Fig. 23, an example of representation of four swaths before and after registration is given. In the work of Yang et al. [144], multibeam echosounder data are processed by an interferometry method combining active contours for accurate sea bottom profile localization [89]. Typically, bottom profiling is dependent on the angles of incidence of the acoustic waves: bottom echoes are better detected by checking the amplitudes in near-specular directions, whereas, for oblique angles of incidence, phase information provides more accurate echo estimates. In any case, it is difficult to select a phase-based or amplitude-based method for better accuracy. This work proposes the joint use of phase and amplitude information embedded in an active contour or snake model [145]. A snake is an energy-minimizing spline guided by internal ( ) and external ( ) forces that may vary the curve toward an equilibrium state. Therefore, if one represents a contour in parametric form,, the total energy of the spline is given by (a) (51) The internal force is represented by curve derivatives up to a certain order, and the external force is provided by data, simply considering the gradient estimate. The above equation can be transformed into Euler equations, the solutions of which identify the contour. Finally, concerning pencil beam sonar data processing, a few papers can be cited that aim at environment reconstruc- Fig. 23. (a) swath configuration before registration. (b) 222 swath configuration after registration (global matching) [143]. (b) 1936 PROCEEDINGS OF THE IEEE, VOL. 88, NO. 12, DECEMBER 2000

35 tion [77], [88], [146], [147]. As data of this type are practically 2-D profiles (azimuth-range), the proposed methods are based on various combinations of such profiles, mainly based on curve fitting according to the specific scenario to be recovered [77], [88], [147]. B. Classification, Recognition, and Visualization Understanding acoustic images is quite a complex task, due to the noisy nature of data and the difficulty of visual interpretation for inexpert human operators. Therefore, several techniques have been proposed for the detection of objects or entities of interest, their classification and recognition, up to their final visualization for a better comprehension. The difference between recognition and classification is not very sharp: here, we can speak of recognition when we deal with the identification of objects represented by 3-D data, whereas classification can be defined as the task of assigning semantic meanings to certain subsets of data, on the basis of the estimation of certain features (i.e., such subsets may not necessarily be identified by geometric features or 3-D data). In this section, the most significant techniques for 3-D images will be reviewed together with all the methods aimed, in general, at a better acoustic image understanding and a final visualization. The nature of detected objects can be estimated in several ways, starting from data processing closer to raw signals to higher level algorithms. Some methods are based on frequency analysis, in which the so-called frequency signature is different for different objects and also for different aspects of the same object so that it can be useful to discriminate between objects and their appearances. Other methods, more typical in the area of imaging, are based on an accurate (high resolution) reconstruction, which allows the extraction of features to be used for classification by pattern recognition and computer vision methods. Further, there are also a set of techniques devoted to underwater vehicle applications (typically, AUVs), like tracking, environment modeling, vehicle positioning, up to augmented visualization (augmented/virtual reality). Such techniques aim at improving vehicle navigation by providing means for obstacle avoidance, path planning, and scene reconstruction. Since these tasks are application-oriented, the cited works present procedural methodologies, mainly starting from data filtering and segmentation up to image understanding. 1) Object Recognition and Classification: In [111], seafloor object detection and classification are faced by using a continuous transmission frequency-modulated (CTFM) sonar. In this work, the sensor characteristics are compared with those of more common pulsed sonars, and the echo classification problem is addressed. Two kinds of classification methods are used, i.e., frequency signature and fine range classification, and a set of experimental tests on spheres, drums, simple seabed, and other targets are described and results reported. It is shown that automatic classification can be performed efficiently, especially in discriminating between rough and smooth (i.e., man-made) objects. Classification performances are similar to those of human operators when data are presented to them in multiple formats (visual and auditory displays). A structured method to detect toxic-waste deposits in sidescan images is described in [148]. The proposed system is composed of a series of processing modules hierarchically subdivided into low-, middle-, and high-level phases. After the generation of a sidescan image, a filtering (median) method and a segmentation method based on thresholding (using hysteresis) are applied to detect shadow, echo, and reverberation (background) areas. Insignificant areas are eliminated, and echo and shadow areas are identified by a connected component algorithm and labeled. Therefore, a set of discriminant features, like area, perimeter, moments (centroid, etc.), algebraic invariants, elongation, shape factor, differential SNR, and others are computed for each region and combined in a score function indicating the degree of confidence for the association of the considered region with a possible waste deposit. System performances are dependent on the ability of an expert operator to train a framework in terms of association rules, function parameters, feature selection and combination. In [149], a method for automatic interpretation of forwardlooking sonar images is presented. Forward-looking sonars are high-frequency sensors used in the medium/long range distance and devoted to navigation, obstacle avoidance and general monitoring purposes. In this paper, this sonar is used to identify four different kinds of objects, i.e., divers, pier legs, anchor chains, and a jetty. A pattern recognition approach is proposed that utilizes a set of features, like geometrical shape, size, and gray level to qualitatively classify the aforesaid different targets. These features make it possible to classify objects despite the sonar characteristics and attitude (hence, object appearance), and noise, by utilizing an appropriate thresholding for the feature values. In summary, the method consists in: 1) applying a median filter to reduce speckle; 2) discriminating between background and interesting (object) pixels by using a different median filter; 3) (binary) segmenting the image by using gray-level differences in the images produced at the previous two steps; 4) applying morphologic erosion/dilation operators [7] to remove spurious pixels; 5) computing features from the resulting observed patches; and 6) classifying the patches. The set of features is chosen by principal component analysis (PCA) [150]: such a technique is used to select the set of features providing the best discrimination among the present classes (unsupervised approach). The actual classification stage is then performed by comparing qualitative features (derived from numerical values) with the corresponding feature set derived from exemplars. In [151] and [152], the same problem concerning the interpretation of forward-looking sonar images is presented, but, unlike the previous system, the main characteristic of the system used is the exploitation of temporal features extracted directly from a sequence of images. This allows one to discriminate among backscattered echoes. Cartesian images derived from sonar returns are first filtered and segmented using the combination of a thresholding procedure, a MURINO AND TRUCCO: THREE-DIMENSIONAL IMAGE GENERATION AND PROCESSING IN UNDERWATER ACOUSTIC VISION 1937

36 median filtering, and a region-growing technique, to remove the background from useful information. After the identification of significant regions, shape descriptors (e.g., length, area, axes, etc.) and topological features (e.g., mean contrast, mean value, variance, etc.) are extracted from a single image and tracked in the sequence. Temporal features are extracted directly from the behavior of static ones, using statistical measures. Finally, supervised classification is performed by using linear discriminant functions (derived from a representative training set) which are applied to classify the different kinds of entities (e.g., divers, chains, rudder) present in the scene. Following the same approach, a supervised classification procedure based on the variations in interframe feature measures is proposed in [153] to identify divers and underwater vehicles in a sequence of sector-scan sonar images. After a learning phase with noise-free data, the devised procedure includes filtering, segmentation, feature extraction and selection, and final classification. Since segmentation results affect classification performances, a sensitivity analysis is also made varying the segmentation threshold. The main result of this work lies in the robustness of the proposed method, thanks to the careful selection of the features even in the presence of highly degrading noise. Actually, classification accuracy is also considerable, reaching a mean value of 97.4% (averaged by varying the segmentation threshold, and with the image SNR db), and the worst value of 95.9% when the image SNR db (for a fixed segmentation threshold). The works in [154] and [155] deal with the recognition and attitude estimation of objects in 2-D acoustic images acquired with a forward looking sonar or an acoustic camera. The regularity of man-made objects is here exploited; by regularity we mean that an object can be approximated by using geometric surfaces bounded by straight lines, owing to the use of sensors operating at a higher frequency and shorter range, allowing the acquisition of images from which geometric features can be reliably extracted. The technique is able to recognize an object and, at the same time, to estimate its 2-D pose, and is based on a simple voting approach applied directly to the edge discontinuities in an underwater acoustic image. Two different processing levels of increasing complexity are used to detect significant image features (e.g., straight lines) with different accuracy. A database of object models is generated offline taking into account several 2-D poses dependent on different object projections. Therefore, the first level aims at detecting the presence of an object and its rough pose in a very fast way by simply computing the histogram of edge orientations. This profile is obtained by summing up the votes received from edge points with similar gradient orientations and by looking for significant peaks, i.e., orientations common to several edge points in the image. Robustness is obtained by preliminarily storing in the model database information about the angular relations between the adjacent object model sides that must be considered during histogram analysis in order to avoid false peaks generated by noise, background points or other boundary points belonging to different objects. The second level is computationally more complex as it allows one to get both a more accurate pose estimate and the recognition of an object. The angles between the boundary segments of a 2-D object model are used to focus attention on significant parts of the histogram containing votes coming from the segments of the object; then, the boundary segments of the 2-D object model are matched with the 2-D straight segments extracted from the original image to provide a complete recognition. The proposed method was proved to be intrinsically robust to noise and invariant to translation and scale, thus allowing the discrimination between different man-made objects and the detection of their orientations. Fig. 24 shows three synthetic acoustic images containing three objects and the related segmentation results. Quantitative results on synthetic images prove that the orientation accuracy of the method is about 1.4 on average when the image SNR is 15 db, and 3 on average when the image SNR is 10 db. For smaller SNR values, the method does not succeed in detecting all object boundaries; therefore, recognition fails, but the mean error value for the detected contours is still small, being equal to 4. Markov random fields using deformable templates are also utilized to discriminate between man-made objects of given shapes and natural objects [156]. A high-resolution acoustic image is first segmented by a probabilistic algorithm into reverberation and shadow regions, so becoming a binary image. The basic idea is to exploit the regularity of man-made objects, trying to fit a parameterized (i.e., deformable) model to the segmented image, looking for shadow figures similar to the model adopted. Shape transformations in terms of scaling, stretching, rotation, and skewness are allowed owing to the model parameterization, and simple geometrical figures (rectangle, circle) are used. A typical MAP estimation is performed by devising an energy function including an edge-based term, which penalizes the distance of the image edges from the current configuration of the template, and a region-based term favoring the homogeneity of the area inside the template. Energy minimization is carried out by a genetic approach, and thresholds are heuristically set to check the final energy value indicating whether the algorithm has succeeded in locating the model or not. The latter approach constitutes a statistical classification method, alternative to the aforesaid methodologies mainly derived on straightforward pattern recognition procedures based on the estimation of several kind of features (geometrical, topological, temporal). 2) Environment Modeling: In the context of autonomous underwater vehicle (AUV) applications, scene reconstruction, namely, environment modeling, allows an operator to drive a vehicle in a faster a more reliable way. Reconstruction can be performed by updating progressively an available map on the basis of actual data, or tracking objects or features in an image sequence, assuming the geometry of the sensor system is known. Another associated problem is constituted by the estimation of the position of a vehicle, that can be performed by registering available (historical) map with actual sonar data or estimating motion parameters from an image sequence PROCEEDINGS OF THE IEEE, VOL. 88, NO. 12, DECEMBER 2000

37 Fig. 24. Three examples of forward-looking sonar images with targets and related segmentation. One of the first attempts is described in [157], in which a knowledge-based (KB) system for processing sonar data and information flows management related to an AUV is presented. The system is based on a blackboard architecture [158] guiding and supervising the activities of all the vehicle modules (sensory and control), and is devoted to maintaining a 3-D (graphical) representation of the surrounding environment. The acoustic-data interpretation module is used to generate object hypotheses from forward-looking sonar images. First, a segmentation phase (threshold-based) is performed to identify shadow, target (echo), and background areas. Such a phase is structured in such a way as to discard clutters due to noise and multiple reflections, thanks to the application of a moving average filter and the estimation of local dynamic thresholds. Secondly, geometric and grey-level features (area, centroid, moments, mean value, variance, contrast, etc.) are computed for each region, and compared with exemplars (previously learned) for the final classification. The KB system manages the various phases in order to identify and recognize the objects of interest, and also estimates their degrees of reliability. After recognition, such objects can be included in the graphical domain representation and displayed to the operator. In [159], a spatial mapping system for scene interpretation is described. Multiresolution depth maps integrated with an a priori model constitute the world model to be updated on the basis of acoustic data acquired by downward- and forward-looking sonars. In particular, quadtree representations are used to store depth maps at different resolutions; an a priori depth map from a previous survey and an additional confidence map are utilized to discriminate between actual sonar returns and clutters. The world model is updated on the basis of newly acquired data in three ways: new data may confirm the knowledge already contained in the model (increasing confidence), new data may present conflicting information (decreasing confidence), new data are insignificant and no updating is performed. Therefore, for the bathymetric sonar, a set of neighborhood pixels to be updated is determined on the basis of the current vehicle location, the beam width and the bottom distance, and the related confidence is updated in accordance with the actual new measures. If the confidence is below a predefined threshold, the old depth value is replaced with the new measure. The forward-looking (obstacle avoidance) sonar data require a more complex procedure for map updating. In this case, the depth map is dynamically updated, depending on the presence or absence of obstacles. This approach presents interesting and computationally efficient solutions to the problems of autonomous vehicle navigation and dynamic world modeling. In [160] and [161], the registration of a local high-resolution depth map and a reference bathymetric map is addressed aimed at autonomous vehicle navigation and positioning. The method consists of several steps. First, steep gradient contours are extracted by using the Laplacian of Gaussian [7], high-curvature points are detected in both maps, and point matches are searched for in the two maps, for the registration. Matching is performed by computing a feature vector including gradient magnitude, depth, and Gaussian curvature for each critical point in the two maps, and calculating the Mahalanobis distance for MURINO AND TRUCCO: THREE-DIMENSIONAL IMAGE GENERATION AND PROCESSING IN UNDERWATER ACOUSTIC VISION 1939

38 each couple of feature vectors. In practice, this measure is an Euclidean distance between vectors, weighted with the covariance of the feature estimates. Matching is successful for points correspondences related to distances less than a properly chosen threshold. Finally, the maps registration is obtained by using an extended Kalman filter, looking for the rotation angle and the translation vector that minimize the cumulative error on all the pair of matching points. Morphological multiscale analysis by partial differential equations is used to reduce noise and artifacts, while preserving critical point information, so improving the reliability of the points extracted. The approach is interesting even though map scale (i.e., resolution) is not taken into account and this limits the applicability of the method. Based on this work, a faster method was recently proposed in [162]. It aims at matching irregular one-dimensional profiles with a regular reference elevation map for vehicle positioning and navigation. Without searching for discriminating features in the maps, this technique exploits only the 1-D profile measured with an echosounder and proposes two matching criteria, based on a statistical and a fuzzy approach. The former approach involves the assumption about a normal distribution of each individual measure, and the estimation of the Mahalanobis distance between the measured elevation profile and the corresponding profile in the reference map. The latter approach assumes a heuristic uncertain volumetric figure associated with each measure, and intersection of fuzzy sets is performed to find the best match. In both cases, tracking is adopted to improve and fasten the match operation, thus reducing the search space. Results appear notable, as the average accuracy of the estimation remains below the map resolution, in accordance with the navigation task: in the case of statistical matching, 83 m and 78 m the and directions, respectively, and, in the case of fuzzy matching, 64 m in both directions, when the map cell is 100 m in both directions. Vehicle positioning and navigation are also addressed in [163]. Forward-scan sonar data are utilized directly by using computer vision techniques (i.e., optical versus acoustic flow) to estimate position and motion parameters. The basic idea is to estimate the motion of a vehicle by computing the acoustic flow in a similar manner as adopted for the optical flow in the case of optical images. A set of preprocessing steps are first applied to acoustic data in order to remove spurious information and smooth away noise. Then, the acoustic flow is estimated as a function of positional parameters, by assuming a set of simplifying hypothesis. Results are promising and agree with navigational data provided by inertial sensors. Some difficulties have been encountered, due to the very noisy nature of acoustic data. Nevertheless, the approach is interesting thanks to the introduction of the new concept of acoustic flow, which deserves more investigation. The tracking problem is another interesting issue mainly faced in the context of underwater vehicle applications, in particular, for navigation, obstacle avoidance, path planning, and environment reconstruction [78]. Assuming known the geometry of the sensor system and tracking objects in an image sequence allows both the reconstruction of the observed scene and the prediction of target trajectories, hence a useful support for a vehicle operator. For example, the problem of the correspondence of target objects in a pair of sidescan sonar images is faced in [164] by a procedure starting from a three-class segmentation (shadow, echo, and reverberation), so that interesting objects be identified as areas containing echo+shadow classes; secondly, discriminative features (e.g., elongation and variance) are extracted from each selected region, and a (multi) hypothetical reasoning method (decision tree) using a distance metric is applied to find correct matches. A different work is presented in [165], concerning the tracking of objects (typically divers) in sector-scan images. The method is based on the utilization of optical flow theory [7] and delayed decision making, by using a tracking tree constructed to trace multiple hypotheses about possible correspondences. The method appears robust to noise, fast (one four frames/s) and quite precise, providing an accuracy at worst of 5% over a range of 10 m. 3) Visualization and Augmented Image Representation: Human rather than machine interpretation is the goal of visualization techniques able to process a huge amount of data in an efficient, more concise and understandable way. A previous survey of visualization techniques used in underwater applications is reported in [166]. All kinds of data (echo-sounder, side-scan, forward-scan) can be processed by using methods derived from computer vision and computer graphics. These techniques are largely usable in such underwater tasks, as they make it possible to recognize objects before synthetically visualizing them, or display sonar data, while taking suitably into account sensor resolution and ambiguity, position and attitude errors, and processing uncertainty. Similar to the vision approach, scientific visualization techniques can also be used to manipulate underwater acoustic data. Volumetric rendering, together with adequate preprocessing using multiple sonar returns, can be utilized to build a 3-D numerical model from which it is possible to extract other kinds of information, like hull profiles [167] and contour maps. Also sidescan sonar data can be processed to build more understandable images, for instance, showing to the operator a pair of stereo images with some texture mapped onto them. Multiscale, mosaicking and contour-matching algorithms can be used to reconstruct, register, and visualize low-resolution acoustic data. AUV kinematics and dynamic modeling is an equally important area useful to perform simulations of the behavior of underwater vehicles prior to their actual development and for the training of operators. In essence, modeling virtual missions is becoming more and more important: the goal is to achieve real-time performances for both vehicle control and the interpretation of actual multisensory data. For these reasons, the fusion and interpretation of data are useful to facilitate the tasks of human operators and to increase the vehicle efficiency, so reducing the operational costs. In [168] and [169], a skeletonization method is proposed to identify, recognize and visualize objects in 3-D acoustic 1940 PROCEEDINGS OF THE IEEE, VOL. 88, NO. 12, DECEMBER 2000

39 images. The tackled application consists in the synthetic visualization of recognized objects onto real data (augmented reality) in order to improve the understanding of the human operator of an underwater vehicle. The considered objects are parts of an off-shore rig composed of several tubes intersecting each other in different ways. This work operates directly on the raw acoustic images adequately filtered to remove clutters and noise. The algorithm consists in the iterative application of a 3-D local operator able to contract the set of points into thin structures, so identifying the objects skeletons. The operator is able to shift points lying on a border toward the inner part of the set, while maintaining those inside the set quite static. After this phase, skeletons, seen as the contracted original set of points, can be easily identified, and a binary segmentation of such points into branches and joints is performed. Then, once different regions associated with the branches have been identified, inertia tensors are estimated for each region to assess whether they belong to cylindrical shapes (tubes) or not. Finally, when tubes are identified, the same technique as applied before (inertia tensors) allows one to estimate the axis direction and ray, so that a synthetic representation (VRML models) can be rendered. In Fig. 25, a simple example is given that proves the validity of the used approach and the accuracy of the virtual representation. Fig. 25(a) shows the original set of points together with the extracted skeleton, and Fig. 25(b) represents the result of the whole procedure: the estimated cylinders are correctly superimposed upon the actual data. Fusion of multisensory data for augmented image visualization is addressed in [170]. Acoustic and optical images are processed and cooperatively utilized to obtain a virtual representation of an observed scene. Although acoustic data are not used alone, this approach is worth being mentioned as it exploits optical features and a calibration procedure to segment and visualize acoustic data. The method aims at the understanding and virtual/augmented visualization of underwater scenes for vehicle navigation. The idea is to calibrate together two sensors (i.e., an optical and an acoustic camera) in order to locate them in a common 3-D reference system. Without detailing the complete procedure, it is sufficient to know that both sensors are calibrated with respect to the object they are looking at, without the design of special objects, as typically required. In this way, the pose (i.e., 3-D position) of each camera is known, hence, the roto-translation matrix transforming data from a reference system to another can be easily applied to project acoustic 3-D points onto the optical image plane. Therefore, by extracting object edges from the optical image and by mapping 3-D points into it, reliable (target) points can be identified, while discarding those belonging to the background. Finally, the scene objects can be visualized as a 3-D representation by texture-mapping the correct depth into the optical image, for instance, by the Delaunay triangulation. A typical example is given in Fig. 26. The projected 3-D acoustic points onto the optical image are shown in Fig. 26(a), together with the object edges extracted (oil-rig pipes). In Fig. 26(b), the result of the Delaunay triangulation over the points belonging to the structure is dis- (b) Fig. 25. (a) Original set of 3-D points and related skeletons. (b) Augmented reality image in which identified cylinders are superimposed with the original 3-D point set. played, and, in Fig. 26(c), the related 3-D representation is depicted. (a) IX. CONCLUSION In this paper, a survey of the methodologies used for 3-D acoustic image formation and processing has been reported. The first part of the paper has described the several approaches to acoustic image formation, outlining their characteristics and significant properties. First, a unified model of the acoustic image formation process has been proposed, and, second, lens-based, holographic, and beamforming approaches have been detailed, together with main features and comparative analyses, and available systems and prototypes have been described. After a brief review of 3-D acoustic imaging in medical applications, the second part of the paper has been focused on image processing techniques for segmentation, reconstruction, and scene understanding in general. Despite the vast literature covering this subject, no unified approach is commonly recognized as a standard one. A filtering stage is the basis for each successive processing phase, as image quality is usually too poor for human visual understanding and high- /medium-level processing. Typically, noise effects and sidelobe interferences are reduced by using mask filters (e.g., median ones) or applying statistical restoration algorithms, MURINO AND TRUCCO: THREE-DIMENSIONAL IMAGE GENERATION AND PROCESSING IN UNDERWATER ACOUSTIC VISION 1941

, MRFs) have been proved to be the most widely used and the most effective for acoustic image processing, as, being based on rigorous modeling of noise and of the physical image formation process,

Higher level image understanding is mainly applicationoriented, hence it is addressed by different techniques, depending on the actual goal to be attained.

40 (a) these, probabilistic methodologies (e.g., MRFs) have been proved to be the most widely used and the most effective for acoustic image processing, as, being based on rigorous modeling of noise and of the physical image formation process, they achieve better results in comparison with other (deterministic) techniques although at higher cost. Higher level image understanding is mainly applicationoriented, hence it is addressed by different techniques, depending on the actual goal to be attained. Object recognition and classification, environment modeling, and information visualization have been the problems faced in this paper, as they typically deal with underwater vehicle applications. A large number of works have been quoted and analyzed, trying to identify their significant characteristics and analogies, providing qualitative evaluations and quantitative estimates where possible, thus giving a complete and systematic view of the current state of the art of three-dimensional acoustic image formation and data processing techniques for scene interpretation in underwater applications. ACKNOWLEDGMENT The authors would like to thank the anonymous reviewers who, with their comments, have improved the quality and the readability of this paper. They are also grateful to R. K. Hansen and R. Jakobsen (Omnitech A/S, Norway) and to K. M. Houston (Draper Laboratory, USA), who kindly provided the real data used in some experiments presented in this paper. (c) Fig. 26. (a) Optical image in which edges bounding the tubes are extracted and 3-D points derived by sonar returns are projected in the image plane. (b) Delaunay triangulation of the 3-D points, and (c) related 3-D representation. (b) trying to reach a tradeoff between computational complexity and increase in image quality. Furthermore, a few methods aim at improving quality directly affecting the image formation process, leading to the so-called near-sensor techniques. Segmentation and reconstruction are addressed by statistical, geometric, and few other approaches. Among REFERENCES [1] J. S. Jaffe, Computer modeling and the design of optical underwater imaging systems, IEEE J. Oceanic Eng., vol. 15, no. 2, pp , [2] F. M. Caimi, B. C. Bailey, and J. H. Blatt, Undersea object detection and recognition: The use of spatially and temporally varying coherent illumination, in IEEE Oceans 99, Seattle, WA, Sept. 1999, pp [3] F. M. Caimi and D. M. Kocak, Undersea imaging advances, Sea Technol., pp , Aug [4] J. S. Jaffe, K. D. Moore, D. Zawada, B. I. Ochoa, and E. Zege, Underwater optical imaging: New hardware & software, Sea Technol., pp , July [5] B. Kamgar-Parsi, B. Johnson, D. L. Folds, and E. O. Belcher, Highresolution underwater acoustic imaging with lens-based systems, Int. J. Imaging Syst. Technol., vol. 8, pp , [6] R. K. Hansen and P. A. Andersen, The application of real time 3D acoustical imaging, in IEEE/OES Int. Conf. Oceans 98, Nice, France, Sept. 1998, pp [7] D. Ballard and D. Brown, Computer Vision. Englewood Cliffs, NJ: Prentice-Hall, [8] J. L. Sutton, Underwater acoustic imaging, Proc. IEEE, vol. 67, pp , Apr [9] P. N. Keating, T. Sawatari, and G. Zilinskas, Signal processing in acoustic imaging, Proc. IEEE, vol. 67, pp , Apr [10] R. J. Urick, Principles of Underwater Sound, 3rd ed. New York: McGraw-Hill, [11] Z. H. Cho, J. P. Jones, and M. Singh, Foundations of Medical Imaging. New York: Wiley, [12] O. George and R. Bahl, Simulation of backscattering of high frequency sound from complex objects and sand sea-bottom, IEEE J. Oceanic Eng., vol. 20, pp , Apr [13] T. L. Henderson and S. G. Lacker, Seafloor profiling by a wideband sonar: Simulation, frequency-response, optimization, and results of a brief sea test, IEEE J. Oceanic Eng., vol. 14, pp , Jan PROCEEDINGS OF THE IEEE, VOL. 88, NO. 12, DECEMBER 2000

Plane Wave Imaging Using Phased Array Arno Volker 1

11th European Conference on Non-Destructive Testing (ECNDT 2014), October 6-10, 2014, Prague, Czech Republic More Info at Open Access Database www.ndt.net/?id=16409 Plane Wave Imaging Using Phased Array