Transformation Invariance in Pattern Recognition: Tangent Distance and Propagation

Size: px

Start display at page:

Download "Transformation Invariance in Pattern Recognition: Tangent Distance and Propagation"

Meghan Newton
6 years ago
Views:

1 Transformation Invariance in Pattern Recognition: Tangent Distance and Propagation Patrice Y. Simard, 1 Yann A. Le Cun, 2 John S. Denker, 2 Bernard Victorri 3 1 Microsoft Research, 1 Microsoft Way, Redmond, WA E-mai: patrice@microsoft.com 2 AT&T Labs, 100 Schuz Dr., Redbank, NJ, E-mai: (Y.A.L): yann@research.att.com; E-mai (J.S.D.): jsd@research.att.com 3 LATTICE-CNRS, ENS Paris, France. E-mai: victorri@ens.fr ABSTRACT: In pattern recognition, statistica modeing, or regression, the amount of data is a critica factor affecting the performance. If the amount of data and computationa resources are unimited, even trivia agorithms wi converge to the optima soution. However, in the practica case, given imited data and other resources, satisfactory performance requires sophisticated methods to reguarize the probem by introducing a priori knowedge. Invariance of the output with respect to certain transformations of the input is a typica exampe of such a priori knowedge. We introduce the concept of tangent vectors, which compacty represent the essence of these transformation invariances, and two casses of agorithms, tangent distance and tangent propagation, which make use of these invariances to improve performance John Wiey & Sons, Inc. Int J Imaging Syst Techno, 11, , 2000 I. INTRODUCTION Pattern Recognition is one of the main tasks of bioogica information processing systems, and a major chaenge of computer science. The probem of pattern recognition is to cassify objects into categories, given that objects in a particuar category may have widey varying features and objects in different categories may have quite simiar features. A typica exampe is handwritten digit recognition. Characters, typicay represented as fixed-size images (e.g., pixes), must be cassified into 1 of 10 categories using a cassification function. Buiding such a cassification function is a major technoogica chaenge, as irreevant variabiities among objects of the same cass must be eiminated and meaningfu differences between objects of different casses must be identified. These cassification functions for most rea-pattern recognition tasks are too compicated to be synthesized by hand using ony what humans know about the task. Instead, we use sophisticated techniques that combine humans a priori knowedge with information automaticay extracted from a set of abeed exampes (the training set). These techniques can be divided into two camps, according to the number of parameters they require: the memory-based agorithms, which in effect store a sizeabe subset of the entire training set, and Correspondence to: P. Simard The majority of this work was competed at AT&T Labs. the earned-function techniques, which earn by adjusting a comparativey sma number of parameters. This distinction is arbitrary because the patterns stored by a memory-based agorithm can be considered the parameters of a very compex earned function. The distinction is, however, usefu in this work. This is because memorybased agorithms often rey on a metric that can be modified to incorporate transformation invariances; the earned-function agorithms consist of seecting a cassification function, the derivatives of which can be constrained to refect the same transformation invariances. The two methods for incorporating invariances are different enough to justify two independent sections. A. Memory-Based Agorithms. To compute the cassification function, many practica pattern recognition systems and severa bioogica modes simpy store a the exampes, together with their abes, in a memory. Each incoming pattern can then be compared with a the stored prototypes. The abes associated with the prototypes that best match the input determine the output. The above method is the simpest exampe of the memory-based modes. Memory-based modes require three things: a distance measure to compare inputs to prototypes, an output function to produce an output by combining the abes of the prototypes, and a storage scheme to buid the set of prototypes. A three aspects have been abundanty treated in the iterature. Output functions range from simpy voting the abes associated with the k cosest prototypes (K-Nearest Neighbors) to computing a score for each cass as a inear combination of the distances to a the prototypes, using fixed (Parzen, 1962) or earned (Broomhead and Lowe, 1988) coefficients. Storage schemes vary from storing the entire training set and picking appropriate subsets of it (Dasarathy, 1991) to storing earned patterns such as earning vector quantization (LVQ; Kohonen, 1984). Distance measures can be as simpe as the Eucidean distance, assuming the patterns and prototypes are represented as vectors, or more compex as in the generaized quadratic metric (Fukunaga and Fick, 1984) or in eastic matching methods (Hinton et a., 1992). A simpe but inefficient pattern recognition method is to use a simpe distance measure, such as Eucidean distance between vectors representing the raw input, combined with a arge set of proto John Wiey & Sons, Inc.

P is transformed (e.g., rotated) according to a transformation s(p, ) that depends on one parameter (e.g., the ange of the rotation), the set of a the transformed patterns S P x for which x sp, (1) Figure 1.

2 P is transformed (e.g., rotated) according to a transformation s(p, ) that depends on one parameter (e.g., the ange of the rotation), the set of a the transformed patterns S P x for which x sp, (1) Figure 1. According to the Eucidean distance, the pattern to be cassified is more simiar to prototype B. A better distance measure woud find that prototype A is coser because it differs mainy by a rotation and a thickness transformation, two transformations that shoud eave the cassification invariant. types. This method is inefficient because amost a possibe instances of a category must be present in the prototype set. In the case of handwritten digit recognition, this means that digits of each cass in a possibe positions, sizes, anges, writing styes, ine thicknesses, and skews must be stored. In rea situations, this approach eads to impracticay arge prototype sets or to mediocre recognition accuracy (Fig. 1). An unabeed image of a thick, santed 9 must be cassified by finding the cosest prototype image from two images representing, respectivey, a thin, upright 9 and a thick, santed 4. According to the Eucidean distance (sum of the squares of the pixe to pixe differences), the 4 is coser. The resut is an incorrect cassification. The cassica way of deaing with this probem is to use a so-caed feature extractor. Its purpose is to compute a representation of the patterns that is minimay affected by transformations of the patterns that do not modify their category. For character recognition, the representation shoud be invariant with respect to position, size changes, sight rotations, distortions, or changes in ine thickness. The design and impementation of feature extractors is the major botteneck of buiding a pattern recognition system. For exampe, the probem iustrated in Figure 1 can be soved by desanting and thinning the images. An aternative is to use an invariant distance measure constructed in such a way that the distance between a prototype and a pattern wi not be affected by irreevant transformations of the pattern or of the prototype. With an invariant distance measure, each prototype can match many possibe instances of pattern, thereby greaty reducing the number of prototypes required. The natura way of doing this is to use deformabe prototypes. During the matching process, each prototype is deformed so as to best fit the incoming pattern. The quaity of the fit, possiby combined with a measure of the amount of deformation, is then used as the distance measure (Hinton et a., 1992). With the exampe of Figure 1, the 9 prototype woud be rotated and thickened so as to best match the incoming 9. This approach has two shortcomings. First, a set of aowed deformations must be designed based on a priori knowedge. Fortunatey, this is feasibe for many tasks, incuding character recognition. Second, the search for the best-matching deformation is often enormousy expensive and/or unreiabe. Consider the case of patterns that can be represented by vectors. For exampe, the pixe vaues of a pixe character image can be viewed as the components of a 256-dimensiona (256-D) vector. One pattern, or one prototype, is a point in this 256-D space. Assuming that the set of aowabe transformations is continuous, the set of a the patterns that can be obtained by transforming one prototype using one or a combination of aowabe transformations is a surface in the 256-D pixe space. More precisey, when a pattern is a 1-D curve in the vector space of the inputs. In the remainder of this study, we wi aways assume that we have chosen s to be differentiabe with respect to both P and, such that s(p, 0) P. When the set of transformations is parameterized by n parameters i, the intrinsic dimension of the manifod S P is n. For exampe, if the aowabe transformations of character images are horizonta and vertica shifts, rotations, and scaing, the surface wi be a 4-D manifod. In genera, the manifod wi not be inear. Even a simpe image transation corresponds to a highy noninear transformation in the high-dimensiona pixe space. For exampe, if the image of an 8 is transated upward, some pixes osciate from white to back and back severa times. Matching a deformabe prototype to an incoming pattern now amounts to finding the point on the surface that is at a minimum distance from the point representing the incoming pattern. This noninearity makes the matching much more expensive and unreiabe. Simpe minimization methods such as gradient descent (or conjugate gradient) can be used to find the minimum distance point. However, these methods ony converge to a oca minimum. In addition, running such an iterative procedure for each prototype is usuay prohibitivey expensive. If the set of transformations happens to be inear in pixe space, then the manifod is a inear subspace (a hyperpane). The matching procedure is then reduced to finding the shortest distance between a point (vector) and a hyperpane, which is an easy-to-sove quadratic minimization probem. This specia case has been studied and is sometimes referred to as Procrustes anaysis (Sibson, 1978). It has been appied to signature verification (Hastie et a., 1991) and on-ine character recognition (Sinden and Wifong, 1992). This study considers the more genera case of noninear transformations such as geometric transformations of gray-eve images. Remember that even a simpe image transation corresponds to a highy noninear transformation in the high-dimensiona pixe space. The main idea of this study is to approximate the surface of possibe transforms of a pattern by its tangent pane at the pattern, thereby reducing the matching to finding the shortest distance between two panes. This distance is caed the tangent distance. The resut of the approximation is shown in Figure 2, in the case of rotation for handwritten digits. The theoretica curve in pixe space which represents Eq. (1), together with its inear approximation, is shown in Figure 2 (top). Points of the transformation curve are depicted beow for various amounts of rotation (each ange corresponds to a vaue of ). Figure 2 (bottom) depicts the inear approximation of the curve s(p, ) given by the Tayor expansion of s around 0: sp, sp, 0 sp, O 2 P T (2) This inear approximation is competey characterized by the point P sp, and the tangent vector T. Tangent vectors, aso caed the Lie derivatives of the transformation s, wi be discussed in 182 Vo. 11, (2000)

Figure 4. Learning a given function (soid ine) from a imited set of exampes (x 1 to x 4 ). The fitted curves are shown by a dotted ine.

(Right) The fitted curves go through each exampe and its derivatives evauated at the exampes agree with the derivatives of the given function. ony the unknown pattern (one-sided distance).

(Top) Representation of the effect of the rotation in pixe space. (Midde) Sma rotations of an origina digitized image of the digit 2 for different ange vaues of.

3 Figure 4. Learning a given function (soid ine) from a imited set of exampes (x 1 to x 4 ). The fitted curves are shown by a dotted ine. (Left) The ony constraint is that the fitted curve goes through the exampes. (Right) The fitted curves go through each exampe and its derivatives evauated at the exampes agree with the derivatives of the given function. ony the unknown pattern (one-sided distance). Athough we concentrate on using tangent distance to recognize images, the method can be appied to many different types of signas, such as tempora signas, speech, and sensor data. Figure 2. (Top) Representation of the effect of the rotation in pixe space. (Midde) Sma rotations of an origina digitized image of the digit 2 for different ange vaues of. (Bottom) Images obtained by moving aong the tangent to the transformation curve for the same origina digitized image P by adding various amounts () of the tangent vector T. Section IV. For reasonaby sma anges ( 1), the approximation is very good (Fig. 2). Figure 3 iustrates the difference among the Eucidean distance, the fu invariant distance (minimum distance between manifods), and the tangent distance. Both the prototype and the pattern are deformabe (two-sided distance). However, for simpicity or efficiency reasons, it is aso possibe to deform ony the prototype or Figure 3. Iustration of the Eucidean distance and the tangent distance between P and E. The curves S p and S e represent the sets of points obtained by appying the chosen transformations (e.g., transations and rotations) to P and E. The ines going through P and E represent the tangent to these curves. Assuming that working space has more dimensions than the number of chosen transformations (on the diagram, assume one transformation in a 3-D space), the tangent spaces do not intersect and the tangent distance is uniquey defined. B. Learned-Function Agorithms. Rather than trying to keep a representation of the training set, it is aso possibe to choose a cassification function by earning a set of parameters. This is the approach taken in neura networks, curve fitting, and regression. We assume that a data are drawn independenty from a given statistica distribution, and our earning machine is characterized by the set of functions it can impement, G w ( x), indexed by the vector of parameters w. We write F( x) to represent the correct or desired abeing of the point x. The task is to find a vaue for w such that G w best approximates F. We can use a finite set of training data to hep find this vector. We assume the correct abeing F( x) is known for a points in the training set. For exampe, G w may be the function computed by a neura net having weights w, org w may be a poynomia having coefficients w. Without additiona information, finding a vaue for w is an i-posed probem uness the number of parameters is sma and/or the size of the training set is arge. This is because the training set does not provide enough information to distinguish the best soution among a the candidate ws (Fig. 4, eft). The desired function F (soid ine) is to be approximated by a functions G w (dotted ine) from four exampes {(x i, F(x i ))} i1,2,3,4. As exempified in Figure 4, the fitted function G w argey disagrees with the desired function F between the exampes, but it is not possibe to infer this from the training set aone. Many vaues of w can generate many different functions of G w, some of which may be terribe approximations of F, even though they are in compete agreement with the training set. Because of this, it is customary to add reguarizers, or additiona constraints, to restrict the search of an acceptabe w. For exampe, we may require the function G w to be smooth, by adding the constraint that w 2 shoud be minimized. It is important that the reguarizer refects a property of F. Hence, reguarizers depend on a priori knowedge about the function to be modeed. Seecting a good famiy {G w, w q } of functions is a difficut task, sometimes known as mode seection (Hastie and Tibshirani, 1990; Hoer and Kennard, 1970). If contains a arge famiy of functions, it is more ikey that it wi contain a good approximation of F (the function we are trying to approximate). However, it is aso more ikey that the seected candidate (using the training set) wi generaize poory because many functions in wi agree with the training data and take outrageous vaues between the training sampes. If, on the other hand, contains a sma famiy of Vo. 11, (2000) 183

4 functions, it is more ikey that a function G w that fits the data wi be a good approximation of F. The capacity of the famiy of functions is often referred to as the VC dimension (Vapnik, 1982; Vapnik and Chervonenkis, 1971). If a arge amount of data is avaiabe, shoud contain a arge famiy of functions (high VC dimension), so that more functions can be approximated, and in particuar, F. If, on the other hand, the data are scarce, shoud be restricted to a sma famiy of functions (ow VC dimension), to contro the vaues between the (more distant) sampes. 1 The VC dimension can aso be controed by putting a knob on how much effect is given to some reguarizers. For instance, it is possibe to contro the capacity of a neura network by adding weight decay as a reguarizer. Weight decay is a heuristic that favors smooth cassification functions, by making a tradeoff by decreasing w 2 at the cost, usuay, of sighty increased error on the training set. Because the optima cassification function is not necessariy smooth, for instance at a decision boundary, the weight decay reguarizer can have adverse effects. As mentioned earier, the reguarizer shoud refect interesting properties (a priori knowedge) of the function to be earned. If the functions F and G w are assumed to be differentiabe, which is generay the case, the search for G w can be greaty improved by requiring that the derivatives of G w evauated at the points { x i } are more or ess equa (this is the reguarizer knob) to the derivatives of F at the same points (Fig. 4, right). This resut can be extended to mutidimensiona inputs. In this case, we can impose the equaity of the derivatives of F and G w in certain directions, not necessariy in a directions of the input space. Such constraints find immediate use in traditiona pattern recognition probems. It is often the case that a priori knowedge is avaiabe on how the desired function varies with respect to some transformations of the input. It is straightforward to derive the corresponding constraint on the directiona derivatives of the fitted function G w in the directions of the transformations (previousy named tangent vectors). Typica exampes can be found in pattern recognition where the desired cassification function is known to be invariant with respect to some transformation of the input such as transation, rotation, and scaing. In other words, the directiona derivatives of the cassification function in the directions of these transformations is zero (Fig. 4). The right part of Figure 4 shows how the additiona constraints on G w hep generaization by constraining the vaues of G w outside the training set. For every transformation that has a known effect on the cassification function, a reguarizer can be added in the form of a constraint on the directiona derivative of G w in the direction of the tangent vector (such as the one depicted in Fig. 2), computed from the curve of transformation. Section II anayzes in detai how to use distance based on a tangent vector in memory-based agorithms; Section III discusses the use of tangent vectors in neura networks, with the tangent propagation agorithm; and Section IV compares different agorithms to compute tangent vectors. II. TANGENT DISTANCE The Eucidean distance between two patterns P and E is in genera not appropriate because it is sensitive to irreevant transformations 1 Note that this point of view aso appies to memory-based systems. In the case where a the training data can be kept in memory, however, the VC dimension is infinite, and the formaism is meaningess. The VC dimension is a earning paradigm and is not usefu uness earning is invoved. of P and of E. In contrast, the transformed distance (E, P) is defined to be the minima distance between the two manifods S P and S E. Therefore, it is invariant with respect to the transformation used to generate S P and S E (Fig. 3). Unfortunatey, these manifods have no anaytic expression in genera. Finding the distance between them is a difficut optimization probem with mutipe oca minima. Besides, true invariance is not necessariy desirabe because a rotation of a 6 into a 9 does not preserve the correct cassification. Our approach consists of computing the minimum distance between the inear surfaces that best approximate the noninear manifods S P and S E. This soves three probems at once: (1) inear manifods have simpe anaytic expressions that can be easiy computed and stored, (2) finding the minimum distance between inear manifods is a simpe east-squares probem that can be soved efficienty, and (3) this distance is ocay, not gobay, invariant. Thus, the distance between a 6 and a sighty rotated 6 is sma but the distance between a 6 and a 9 is arge. The different distances between P and E are represented schematicay in Figure 3. Figure 3 represents two patterns P and E in 3-D space. The manifods generated by s are represented by 1-D curves going through E and P, respectivey. The inear approximations to the manifods are represented by ines tangentia to the curves at E and P. These ines do not intersect in three dimensions and the shortest distance between them (uniquey defined) is D(E, P). The distance between the two noninear transformation curves (E, P) is aso shown on Figure 3. An efficient impementation of the tangent distance D(E, P) is given Section IIA using image recognition as an iustration. We then compare our methods with the best-known competing methods. Finay, we discuss possibe variations on the tangent distance and how it can be generaized to probems other than pattern recognition. A. Impementation. We describe formay the computation of the tangent distance. Let the function s transform an image P to s(p, ) according to the parameter. We require s to be differentiabe with respect to and P and require s(p, 0) P. For exampe, if P is a 2-D image, s(p, ) coud be a rotation of P by the ange. If we are interested in a transformations of images that conserve distances (isometry), s(p, ) woud be a rotation by foowed by a transation by x, y of the image P. In this case (, x, y ) is a vector of parameters of dimension 3. In genera, ( 1,..., m ) is of dimension m. Because s is differentiabe, the set S P { x? for which x s(p, )} is a differentiabe manifod that can be approximated to the first order by a hyperpane T P. This hyperpane is a tangent to S P at P and is generated by the coumns of matrix: sp, L P 0 sp,,..., 1 sp, m (3) 0 which are vectors tangentia to the manifod. If E and P are two patterns to be compared, the respective tangent panes T E and T P can be used to define a new distance D between these two patterns. The tangent distance D(E, P) between E and P is defined by: DE, P min x y 2 (4) xt E,yT P The equation of the tangent panes T E and T P is given by: 184 Vo. 11, (2000)

5 E E E L E E (5) P P P L P P (6) where L E and L P are the matrices containing the tangent vectors (Eq. 3) and the vectors E and P are the coordinates of E and P (using bases L E and L P ) in the corresponding tangent panes. Note that E, E, L E, and E denote vectors and matrices in inear Eq. (5). For exampe, if the pixe space was of dimension 5, and there were two tangent vectors, we coud rewrite Eq. (5) as: E1 E 2 E1 E 2 E 3 E 3 E 4 E 4 E 5 E 5 L11 L12 L 21 L 22 1 L 31 L 32 L 41 L 42 2 (7) L 51 L 52 The quantities L E and L P are attributes of the patterns so, in many cases, they can be precomputed and stored. Computing the tangent distance: DE, P min E, P E E P P 2 min E, P d E, p (8) where d( E, p ) E( E ) P( P ) 2, amounts to soving a inear east-squares probem. In the interest of carity and to make the computationa costs more apparent, the detais of the computation of P and E are speed out (the advanced reader can skip to the next section). The optimaity condition is that the partia derivatives of d( E, P ) with respect to P and E shoud be zero: d E, P E 2E E P P L E 0 (9) d E, P P 2P P E E L P 0 (10) Substituting E and P by their expressions yieds to the foowing inear system of equations, which we must sove for P and E : The soution of this system is: L P E P L P P L E E 0 (11) L E E P L P P L E E 0 (12) L PE L 1 EE L E L P E P L PE L 1 EE L EP L PP P (13) L EP L 1 PP L P L E E P L EE L EP L 1 PP L PE E (14) where L EE L E L E, L PE L P L E, L EP L E L P, and L PP L P L P. LU decompositions of L EE and L PP can be precomputed. The most expensive part in soving this system is evauating L EP (L PE can be obtained by transposing L EP ). It requires m E m P dot products, where m E is the number of tangent vectors for E and m P is the number of tangent vectors for P. Once L EP has been computed, P and E can be computed by soving two (sma) inear systems of, respectivey, m E and m P equations. The tangent distance is obtained by computing E( E ) P( P ) using the vaue of P and E in Eqs. (5) and (6). If n is the dimension of the input space (i.e., the ength of vectors E and P), the agorithm described above requires roughy n(m E 1)(m P 1) 3(m E 3 m P 3 ) mutipyadds. Approximations to the tangent distance can, however, be computed more efficienty. B. Some Iustrative Resuts Loca Invariance. The oca 2 invariance of tangent distance can be iustrated by transforming a reference image by various amounts and measuring its distance to a set of prototypes. Figure 5 (bottom) shows 10 typica handwritten digit images. One of them, the digit 3, is chosen to be the reference. The reference is transated horizontay by the amount indicated in the abscissa. There are 10 curves for Eucidean distance and 10 more curves for tangent distance, measuring the distance between the transated reference and 1 of the 10 digits. Because the reference was chosen from the 10 digits, it is not surprising that the curve corresponding to the digit 3 goes to 0 when the reference is not transated (0 pixe transation). It is cear from Figure 5 that if the reference (the image 3 ) is transated by more than two pixes, the Eucidean distance wi confuse it with other digits, namey 8 or 5. In contrast, there is no possibe confusion when tangent distance is used. As a matter of fact, in this exampe, the tangent distance correcty identifies the reference up to a transation of five pixes. Simiar curves were obtained with a the other transformations (e.g., rotation and scaing). The oca invariance of tangent distance with respect to sma transformations generay impies more accurate cassification for much arger transformations. This is the singe most important feature of tangent distance. The ocaity of the invariance has another important benefit: oca invariance can be enforced with very few tangent vectors. The reason is that for infinitesima (oca) transformations, there is a direct correspondence 3 between the tangent vectors of the tangent pane and the various compositions of transformations. For exampe, the three tangent vectors for X-transation, Y-transation, and rotations around the origin generate a tangent pane corresponding to a the possibe compositions of horizonta transations, vertica transations, and rotations. The resuting tangent distance is then ocay invariant to a the transations and a the rotations (around any center). Figure 6 further iustrates this phenomenon by dispaying points in the tangent pane generated from ony five tangent vectors. Each of these images ooks ike it has been obtained by appying various combinations of scaing, rotation, horizonta and vertica skewing, and thickening. Yet, the tangent distance between any of these points and the origina image is 0. Handwritten Digit Recognition. Experiments were conducted to evauate the performance of tangent distance for handwritten digit recognition. An interesting characteristic of digit images is that we can readiy identify a set of oca transformations that do not affect the identity of the character, whie covering a arge portion of the set of possibe instances of the character. Seven such image transformations were identified: X- and Y-transations, rotation, scaing, two hyperboic transformations (which can generate shearing and 2 Loca invariance refers to invariance with respect to sma transformations (i.e., a rotation of a very sma ange). In contrast, goba invariance refers to invariance with respect to arbitrariy arge transformations (i.e., a rotation of 180 ). Goba invariance is not desirabe in digit recognition, because we need to distinguish a 6 from a 9. 3 An isomorphism actuay, see Lie agebra in (Choquet-Bruhat et a., 1982) Vo. 11, (2000) 185

The first six transformations were chosen to span the set of a possibe inear coordinate transforms in the image pane. (Nevertheess, they correspond to highy noninear transforms in pixe space.

6 Figure 5. Eucidean and tangent distances among 10 typica images of handwritten digits and a transated image of the digit 3. The abscissa represents the amount of horizonta transation (measured in pixes). squeezing), and ine thickening or thinning. The first six transformations were chosen to span the set of a possibe inear coordinate transforms in the image pane. (Nevertheess, they correspond to highy noninear transforms in pixe space.) Additiona transformations have been tried with ess success. Three databases were used to test our agorithm: 1. U.S. Posta Service (USPS) database: The database consisted of pixe size-normaized images of handwritten digits taken from U.S. mai enveopes. The training and testing set had, respectivey, 9,709 and 2,007 exampes. 2. NIST1 database: The second experiment was a competition organized by the Nationa Institute of Standards and Technoogy (NIST) in spring The object of the competition was to cassify a test set of 59,000 handwritten digits, given a training set of 223,000 patterns. 3. NIST2 database: The third experiment was performed on a database made out of the training and testing database provided by the NIST (see above). NIST had divided the data into two sets, which unfortunatey had different distributions. The NIST1 training set (223,000 patterns) was easier than the testing set (59,000 patterns). In our NIST2 experiments, we combined these two sets 50/50 to make a training set of 60,000 patterns and testing and vaidation sets of 10,000 patterns each, a having the same characteristics. For each of these three databases, we tried to evauate human performance to benchmark the difficuty of the database. For the Figure 6. (Left) Origina image. (Midde) Five tangent vectors corresponding, respectivey, to the five transformations: scaing, rotation, expansion of the X axis whie compressing the Y axis, expansion of the first diagona whie compressing the second diagona, and thickening. (Right) 32 points in the tangent space generated by adding or subtracting each of the five tangent vectors. 186 Vo. 11, (2000)

7 Tabe I. Performances in percent of errors for (in order) human, K- nearest neighbor (K-NN), tangent distance (TD), Lenet1 (simpe neura network), Lenet4 (arge neura network), optima margin cassifier (OMC), oca earning (LL) and boosting (Boost). Human K-NN TD Lenet1 Lenet4 OMC LL Boost USPS NIST NIST USPS, two members of our group went through the test set and both obtained a 2.5% raw error performance. The human performance on NIST1 was provided by the NIST. The human performance on NIST2 was measured on a sma subsampe of the database and must therefore be taken with caution. Severa of the eading agorithms were tested on each of these databases. The first experiment used the K-nearest neighbor agorithm, using the ordinary Eucidean distance. The prototype set consisted of a avaiabe training exampes. A 1-nearest neighbor rue gave optima performance in USPS, whereas a 3-nearest neighbors rue performed better in NIST2. The second experiment was simiar to the first, but the distance function was changed to tangent distance with seven transformations. For the USPS and NIST2 databases, the prototype set was constructed as before. However, for NIST1, it was constructed by cycing through the training set. Any patterns that were miscassified were added to the prototype set. After a few cyces, no more prototypes are added (the training error was 0). This resuted in 10,000 prototypes. A 3-nearest neighbors rue gave optima performance on this set. Other agorithms such as neura nets (LeCun et a., 1990, 1995), optima margin cassifier (Cortes and Vapnik, 1995), oca earning (Bottou and Vapnik, 1992), and boosting (Drucker et a., 1993) were aso used on these databases. A case study can be found in LeCun et a. (1995). The resuts are summarized in Tabe 1. As iustrated in Tabe 1, the tangent distance agorithm equas or outperforms a other agorithms we tested, in a cases except one: boosted Lenet 4 was the winner on the NIST2 database. This is not surprising. The K-nearest neighbor agorithm (with no preprocessing) is very unsophisticated compared with oca earning, optima margin cassifier, and boosting. The advantage of tangent distance is the a priori knowedge of transformation invariance embedded into the distance. When the training data are sufficienty arge, as is the case in NIST2, some of this knowedge can be picked up from the data by the more sophisticated agorithms. In other words, the vaue of a priori knowedge decreases as the size of the training set increases. C. How to Make Tangent Distance Work. This section is dedicated to the technoogica know how, which is necessary to make tangent distance work with various appications. Tricks of this sort are usuay not pubished for various reasons (e.g., they are not aways theoreticay sound, page area is too vauabe, the tricks are specific to one particuar appication, and commercia competitive considerations discourage teing everyone how to reproduce the resut). However, they are often a determining factor in making the technoogy a success. Severa of these techniques wi be discussed here. Smoothing the Input Space. This is the singe most important factor in obtaining good performance with tangent distance. By definition, the tangent vectors are the Lie derivatives of the transformation function s(p, ) with respect to. They can be written as: L P sp, sp, sp, 0 im 30 (15) It is therefore important that s be differentiabe (and we behaved) with respect to. In particuar, it is cear from Eq. (15) that s(p, ) must be computed for arbitrariy sma. Fortunatey, even when P can ony take discrete vaues, it is easy to make s differentiabe. The trick is to use a smoothing interpoating function C as a preprocessing for P, such that s(c (P), ) is differentiabe (with respect to C (P) and, not with respect to P). For instance, if the input space for P is binary images, C (P) can be a convoution of P with a Gaussian function of standard deviation. Ifs(C (P), ) isa transation of pixes, the derivative of s(c (P), ) can easiy be computed. This is because s(c (P), ) can be obtained by transating Gaussian functions. Preprocessing is discussed in more detai in Section IV. The smoothing factor contros the ocaity of the invariance. The smoother the transformation curve defined by s, the onger the inear approximation wi be vaid. In genera, the best smoothing is the maximum smoothing that does not bur the features. For exampe, in handwritten character recognition with pixe images, a Gaussian function with a standard deviation of one pixe yieded the best resuts. Increased smoothing ed to confusion (such as a 5 mistaken for a 6 because the ower oop had been cosed by the smoothing) and decreased smoothing did not make fu use of the invariance properties. If the avaiabe computation time aows it, the best strategy is to extract features first, smooth shameessy, and then compute the tangent distance on the smoothed features. Controed Deformation. The inear system given in Eq. (8) is singuar if some of the tangent vectors for E or P are parae. Athough the probabiity of this happening is zero when the data are taken from a rea-vaued continuous distribution (as is the case in handwritten character recognition), it is possibe that a pattern may be dupicated in both the training and the test set, resuting in a division by zero error. The fix is simpe and eegant. Equation (8) can be repaced by Eq. (16): DE, P min E, P E L E E P L P P 2 kl E E 2 kl P P 2 (16) The physica interpretation of this equation is iustrated in Fig. 7. The point E( E ) on the tangent pane T E is attached to E with a spring with spring constant k and to P( p ) (on the tangent pane T P ) with spring constant 1, and P( p ) is aso attached to P with spring constant k. (A three springs have zero natura ength.) The new tangent distance is the tota potentia eastic energy stored of a three springs at equiibrium. As for the standard tangent distance, the soution can easiy be obtained by differentiating Eq. (16) with respect to E and P. The differentiation yieds: L P E P L P 1 k P L E E 0 (17) L E E P L P P L E 1 k E 0 (18) Vo. 11, (2000) 187

8 Figure 7. The tangent distance between E and P is the eastic energy stored in each of the three springs connecting P, P,E, and E. P and E can move without friction aong the tangent panes. The spring constants are indicated on the figure. The soution of this system is L PE L 1 EE L E 1 kl P E P L PE L 1 EE L EP 1 k 2 L PP P (19) L EP L 1 PP L P 1 kl E E P 1 k 2 L EE L EP L 1 PP L PE E (20) where L EE L E L E, L PE L P L E, L EP L E L P, and L PP L P L P. The system has the same compexity as the vania tangent distance, except that it aways has a soution for k 0 and is more numericay stabe. Note that in the imit cases, the system yieds the standard tangent distance (k 0) and the Eucidean distance (k ). This approach is aso usefu when the number of tangent vectors is greater or equa than the number of dimensions of the space. The standard tangent distance woud most ikey be zero (when the tangent spaces intersect), but the spring tangent distance sti expresses vauabe information about the invariances. If the number of the dimension of the input space is arge compared with the number of tangent vectors, keeping k as sma as possibe is better. This is because it does not interfere with the siding aong the tangent pane (E and P are ess constrained). Contrary to intuition, there is no danger of siding too far in high dimensiona space because tangent vectors are aways roughy orthogona and they coud ony side far if they were parae. Hierarchy of Distances. If severa invariances are used, cassification using tangent distance aone woud be quite expensive. Fortunatey, if a typica memory-based agorithm is used, for exampe, K-nearest neighbors, it is unnecessary to compute the fu tangent distance between the uncassified pattern and a the abeed sampes. In particuar, if a crude estimate of the tangent distance indicates with sufficient confidence that a sampe is far from the pattern to be cassified, no more computation is needed to know that this sampe is not one of the K-nearest neighbors. Based on this observation, one can buid a hierarchy of distances that can greaty reduce the computation of each cassification. Assume, for instance, that we have m approximations D i of the tangent distance, ordered such that D 1 is the crudest approximation of the tangent distance and D m is exacty the tangent distance (for instance, D 1 to D 5 coud be the Eucidean distance with increasing resoution and D 6 to D 10 coud each add a tangent vector at fu resoution). The basic idea is to keep a poo of a the prototypes that coud potentiay be the K-nearest neighbors of the uncassified pattern. Initiay, the poo contains a the sampes. Each of the distances D i corresponds to a stage of the cassification process. The cassification agorithm has three steps at each stage and proceeds from Stage 1 to Stage m or unti the cassification is compete. In Step 1, the distance D i among a the sampes in the poo and the uncassified pattern is computed. In Step 2, a cassification and a confidence score is computed with these distances. If the confidence is good enough, that is, better than C i (e.g., if a the sampes eft in the poo are in the same cass), the cassification is compete; otherwise, proceed to Step 3. In Step 3, the K i cosest sampes, according to distance D i, are kept in the poo and the remaining sampes are discarded. Finding the K i cosest sampes can be done in O( p) (where p is the number of sampes in the poo) because these eements need not to be sorted (Aho et a., 1983; Press et a., 1988). The reduced poo is then passed to stage i 1. The two constants C i and K i must be determined in advance using a vaidation set. This can easiy be done graphicay by potting the error as a function of K i and C i at each stage (starting with a K i equa to the number of abeed sampes and C i 1 for a stages). At each stage, there are a minimum K i and a minimum C i, which give optima performance on the vaidation set. By taking arger vaues, we can decrease the probabiity of making errors on the test sets. The sighty worse performance of using a hierarchy of distances is often we worth the speed up. The computationa cost of a pattern cassification is then equa to: computationa cost number of prototypes at stage i i distance compexity at stage i probabiity to reach stage i (21) A this is better iustrated in Figure 8. This system was used for the USPS experiment. In cassification of handwritten digits (16 16 pixe images), D 1, D 2, and D 3 were the Eucidean distances at resoution 2 2, 4 4, and 8 8 respectivey. D 4 was the one-sided tangent distance with X-transation, on the sampe side ony, at resoution 8 8. D 5 was the doube-sided tangent distance with X-transation at resoution Each of the subsequent Figure 8. Pattern recognition using a hierarchy of distances. The fiter proceeds from eft (starting with the whoe database) to right (where ony a few prototypes remain). At each stage, distances between prototypes and the unknown pattern are computed and sorted; the best candidate prototypes are seected for the next stage. As the compexity of the distance increases, the number of prototypes decreases, making computation feasibe. At each stage, a cassification is attempted and a confidence score is computed. If the confidence score is high enough, the remaining stages are skipped. 188 Vo. 11, (2000)

9 Tabe II. Summary computation for the cassification of one pattern. i No. T.V. Resoution No. Prototypes (K i ) No. Dot Products Probabiity No. of mu/add , , , , , , , , , , ,000 The first coumn is the distance index; the second coumn indicates the number of tangent vectors (0 for the Eucidean distance); the third coumn indicates the resoution in pixes; the fourth is K i or the number of prototypes on which the distance D i must be computed; the fifth coumn indicates the number of additiona dot products that must be computed to evauate distance D i ; the sixth coumn indicates the probabiity to not skip that stage after the confidence score has been used; and the ast coumn indicates the tota average number of mutipy-adds that must be performed (product of Coumns 3 to 6) at each stage. distances added one tangent vector on each side (Y-transation, scaing, rotation, hyperboic deformation 1, hyperboic deformation 2, and thickness) unti the fu tangent distance was computed (D 11 ). Tabe 2 shows the expected number of mutipy-adds at each stage. It shoud be noted that the fu tangent distance need ony be computed for 1 in 20 unknown patterns (probabiity 0.05) and ony with 5 sampes out of the origina 10,000. The net speed up was in the order of 500, compared with computing the fu tangent distance between every unknown pattern and every sampe (this is six times faster than computing the the Eucidean distance at fu resoution). Mutipe Iterations. Tangent distance can be viewed as one iteration of a Newton-type agorithm that finds the points of minimum distance on the true transformation manifods. The vectors E and P are the coordinates of the two cosest points in the respective tangent spaces, but they can aso be interpreted as the vaue for the rea (noninear) transformations. In other words, E and P can be used to compute the points s(e, E ) and s(p, P ), the rea noninear transformation of E and P. From these new points, we can recompute the tangent vectors and the tangent distance and reiterate the process. If the appropriate conditions are met, this process can converge to a oca minimum in the distance between the two transformation manifods of P and E. This process did not improve handwritten character recognition, but yieded impressive resuts in face recognition (Vasconceos and Lippman, 1998). In that case, each successive iteration was done at increasing resoution (hence, combining hierarchica distances and mutipe iterations), making the whoe process computationay efficient. III. TANGENT PROPAGATION The previous section deat with memory-based techniques. We now appy tangent-distance principes to earned-function techniques. The key idea is to incorporate the invariance directy into the cassification function by way of optimization of its parameters. More precisey, assume the cassification function can be written as G w ( x), where x is the input to the cassifier and w is a parameter vector that must be optimized to yied good cassification. We present an agorithm, caed tangent propagation, in which gradient descent in w is used to improve both cassification and transformation invariance of the training data. In a neura network context, the process can be viewed as a generaization of the widey used back propagation method, which propagates information about the training data. The exception is that the new agorithm aso propagates transformation invariance information. We again assume that a data are drawn independenty from a given statistica distribution and that our earning machine is characterized by the set of functions it can impement, G w ( x), indexed by the vector of parameters w. Ideay, we woud ike to find w, which minimizes the energy function G wx Fx 2 dx (22) where F( x) represents the correct or desired abeing of the point x. In the rea word, we must estimate this integra using ony a finite set of training points B drawn the distribution. That is, we try to minimize p E G w x i Fx i 2 (23) i1 where the sum runs over the training set B. An estimate of w can be computed by foowing a gradient descent using the weight-update rue: w E w (24) Consider an input transformation s( x, ) controed by a parameter. As aways, we require that s is differentiabe and that s( x, 0) x. Now, in addition to the known abes of the training data, we assume that Fsx i, is known at 0 for each point x in the training set. To incorporate the invariance property into G w ( x), we add that the foowing constraint on the derivative: p E r i1 G wsx i, Fsx 2 i, 0 (25) Vo. 11, (2000) 189

10 shoud be sma at 0. In many pattern cassification probems, we are interested in the oca cassification invariance property for F( x) with respect to the transformation s (the cassification does not change when the input is sighty transformed), so we can simpify Eq. (25) to: p E r i1 G 2 wsx i, 0 (26) because Fsx i, 0. To minimize this term, we can modify the gradient descent rue to use the energy function: E E p E r (27) with the weight update rue: w E w (28) The earning rates (or reguarization parameters) and are tremendousy important. This is because they determine the tradeoff between earning the invariances (based on the chosen directiona derivatives) vs. earning the abe itsef (i.e., the zeroth derivative) at each point in the training set. The oca variation of the cassification function, which appears in Eq. (26), can be written as: G w sx, G wsx, sx, 0 sx, 0 x G w x sx, 0 (29) because s( x, ) x if 0. where x G w ( x) is the Jacobian of G w ( x) for pattern x and s(, x)/ is the tangent vector associated with transformation s as described in the previous section. Mutipying the tangent vector by the Jacobian invoves one forward propagation through a inearized version of the network. If is mutidimensiona, the forward propagation must be repeated for each tangent vector. The theory of Lie agebras (Gimore, 1974) ensures that compositions of oca (sma) transformations correspond to inear combinations of the corresponding tangent vectors (this resut is discussed further in Section IV). Consequenty, if E r ( x) 0 is verified, the network derivative in the direction of a inear combination of the tangent vectors is equa to the same inear combination of the desired derivatives. In other words, if the network is successfuy trained to be ocay invariant with respect to horizonta and vertica transations, it wi be invariant with respect to compositions thereof. It is possibe to devise an efficient agorithm, tangent prop, for performing the weight update (Eq. 28). It is anaogous to ordinary back propagation. In addition to propagating neuron activations, it aso propagates the tangent vectors. The equations can be easiy derived from Figure 9. A. Loca Rue. The forward propagation equation is: Figure 9. Forward (a, x,, ) and backward propagated variabes (b, y,, ) in the reguar (roman symbos) and the Jacobian (inearized) network (Greek symbos). Converging forks (in the direction in which the signa is traveing) are sums; diverging forks dupicate the vaues. a i j w 1 ij x j x i a i (30) Where is a noninear differentiabe function (typicay a sigmoid). The forward propagation starts at the first ayer ( 1), with x 0 being the input ayer, and ends at the output ayer ( L). Simiary, the tangent forward propagation (tangent prop) is defined by: i j w 1 ij j i a i i (31) The tangent forward propagation starts at the first ayer ( 1), sx, with 0 being the tangent vector, and ends at the output ayer ( L). The tangent gradient back propagation can be computed using the chain rue: E i k i k E 1 k 1 k i 1 1 k w ki E E i (32) i i i i i a i (33) The tangent backward propagation starts at the output ayer ( L), with L being the network variation G wsx,, and ends at the input ayer. Simiary, the gradient back propagation equation E x i k b i k E 1 a k 1 a k x i y 1 1 k w ki E E x i E i (34) a i x i a i i a i y i b i a i ia i i (35) 190 Vo. 11, (2000)

11 Figure 10. Generaization performance curve as a function of the training set size for the tangent and back prop agorithms. The standard backward propagation starts at the output ayer ( L), with x L G w ( x 0 ) being the network output, and ends at the input ayer. Finay, the weight update is: continuous coordinate transformations, and independent image segment transformations. The next experiment is designed to show that in appications where data are highy correated, tangent prop yieds a arge speed advantage. Because the distortion mode impies adding ots of highy correated data, the advantage of tangent prop over the distortion mode becomes cear. The task is to approximate a function that has pateaus at three ocations. We want to enforce oca invariance near each of the training points (Fig. 11, bottom). The network has 1 input unit, 20 hidden units, and 1 output unit. Two strategies are possibe: either generate a sma set of training points covering each of the pateaus (open squares on Fig. 11, bottom) or generate one training point for each pateau (cosed squares) and enforce oca invariance around them (by setting the desired derivative to 0). The training set of the former method is used as a measure of performance for both methods. A parameters were adjusted for approximatey optima performance in a cases. The earning curves for both modes are shown in Figure 11 (top). Each sweep through the training set for tangent prop is a itte faster because it requires ony six forward propagations, whereas it requires nine in the distortion mode. As can be seen, stabe performance is achieved after 1,300 sweeps for the tangent prop, vs. 8,000 for the distortion mode. The overa speedup is therefore about 10. In this exampe, tangent prop can take advantage of a arge reguarization term. The distortion mode is at a disadvantage because the ony parameter that effectivey contros the amount of w ij E a i a i w ij i w ij E i (36) w ij y i x 1 j 1 i j (37) The computation requires one forward propagation and one backward propagation per pattern and per tangent vector during training. After the network is trained, it is approximatey ocay invariant with respect to the chosen transformation. After training, the evauation of the earned function is in a ways identica to a network that is not trained for invariance (except that the weights have different vaues). B. Resuts. Two experiments iustrate the advantages of tangent prop. The first experiment is a cassification task, using a sma (ineary separabe) set of 480 binarized handwritten digits. The training sets consist of 10, 20, 40, 80, 160, or 320 patterns and the test set contains 160 patterns. The patterns are smoothed using a Gaussian kerne with standard deviation of one-haf pixe. For each of the training set patterns, the tangent vectors for horizonta and vertica transation are computed. The network has two hidden ayers with ocay connected shared weights and one output ayer with 10 units (5,194 connections, 1,060 free parameters; LeCun, 1989). The generaization performance as a function of the training set size for traditiona back and tangent prop is compared in Figure 10. We conducted additiona experiments in which we impemented transations, rotations, expansions, and hyperboic deformations. This set of six generators is a basis for a inear transformations of coordinates for 2-D images. It is straightforward to impement other generators incuding gray-eve shifting, smooth segmentation, oca Figure 11. Comparison of the distortion mode (eft) and tangent prop (right). The top row gives the earning curves (error vs. number of sweeps through the training set). The bottom row gives the fina input-output function of the network; the dashed ine is the resut for unadorned back prop. Vo. 11, (2000) 191

12 reguarization is the magnitude of the distortions. This cannot be increased to arge vaues because the right answer is ony invariant under sma distortions. How To Make Tangent Prop Work Large Network Capacity. Reativey few experiments have been done with tangent propagation. It is cear, however, that the invariance constraint can be extremey beneficia. If the network does not have enough capacity, it wi not benefit from the extra knowedge introduced by the invariance. Intereaving of the Tangent Vectors. Because the tangent vectors introduce even more correation inside the training set, substantia speedup can be obtained by aternating a reguar forward and backward propagation with a tangent forward and backward propagation (even if there are severa tangent vectors, ony one is used at each pattern). For instance, if there were three tangent vectors, the training sequence coud be: x 1, t 1 x 1, x 2, t 2 x 2, x 3, t 3 x 3, x 4, t 1 x 4, x 5, t 2 x 5,... (38) where x i means a forward and backward propagation for pattern i and t j ( x i ) means a tangent forward and backward propagation of tangent vector j of pattern i. With such intereaving, the earning converges faster than grouping a the tangent vectors together. Of course, this ony makes sense with on-ine updates as opposed to batch updates. IV. TANGENT VECTORS We consider the genera paradigm for transformation invariance and for the tangent vectors used in the two previous sections. Before we introduce each transformation and its corresponding tangent vectors, the theory behind the practice is expained. There are two aspects to the probem. First, it is possibe to estabish a forma connection between groups of transformations of the input space (such as transation and rotation of 2 ) and their effect on a functiona of that space (such as a mapping of 2 to, which may represent an image, in continuous form). The theory of Lie groups and Lie agebra (Choquet-Bruhat et a., 1982) aows us to do this. The second probem invoves coding. Computer images are finite vectors of discrete variabes. How can a theory that was deveoped for differentiabe functiona of 2 to be appied to these vectors? We provide a brief expanation of the theorems of Lie groups and Lie agebras, which are appicabe to pattern recognition. We aso expore soutions to the coding probem. Finay, some exampes of transformation and coding are given for particuar appications. A. Lie Groups and Lie Agebras. Consider an input space (e.g., the pane 2 ) and a differentiabe function f that maps points of to. f: X B 3 fx (39) The function f(x) f( x, y) can be interpreted as the continuous (defined for a points of 2 ) equivaent of the discrete computer image P[i, j]. Next, consider a famiy of transformations t, parameterized by, which maps bijectivey a point of to a point of. t : X 3 t X (40) We assume that t is differentiabe with respect to and X and that t 0 is the identity. For exampe, t coud be the group of affine transformations of 2 : t : x y 3 x 1x 2 y 5 3 x y 4 y 6 with (41) This is a Lie group 4 with six parameters. Another exampe is the group of direct isometry: t : x y 3 x cos y sin a x sin y cos b (42) which is a Lie group with three parameters. We now consider the functiona s( f, ), defined by sf, f Et 1 (43) This functiona s, which takes another functiona f as an argument, shoud remind the reader of Figure 2 where P, the discrete equivaent of f, is the argument of s. The Lie agebra associated with the action of t on f is the space generated by the m oca transformations L i of f defined by: sf, L ai f i 0 We can now write the oca approximation of s as: (44) sf, f 1 L 1 f 2 L 2 f... m L m f o 2 f (45) This equation is the continuous equivaent of Eq. (2) used in the introduction. The foowing exampe iustrates how L i can be computed from t. Consider the group of direct isometry defined in Eq. (42) (with parameter (, a, b) as before, and X ( x, y)). sf, X fx a cos y b sin, x a sin y b cos (46) If we differentiate around (0, 0, 0) with respect to, we obtain: i.e., sf, X y f f x, y x x, y (47) x y L y x x y (48) 4 A Lie group is a group that is aso a differentiabe manifod such that the differentiabe structure is compatibe with the group structure. 192 Vo. 11, (2000)

13 The transformation L a x and L b can be obtained in y a simiar fashion. A oca transformations of the group can be written as: sf, f y f f x a x y f f b x y o2 f (49) which corresponds to a inear combination of the three basic operators L, L a, and L b. 5 The most important property is that the three operators generate the whoe space of oca transformations. The resut of appying the operators to a function f, such as a 2-D image, is a set of vectors (referred to as tangent vector in the previous sections). Each point in the tangent space corresponds to a unique transformation. Conversey, any transformation of the Lie group (in the exampe, a rotations of any ange and center together with a transations) corresponds to a point in the tangent pane. B. Tangent Vectors. The ast probem to be soved is that of coding. Computer images, for instance, are coded as a finite set of discrete (even binary) vaues. These are hardy the differentiabe mappings of to, which we assumed in Section IVA. To sove this probem, we introduce a smooth interpoating function C, which maps the discrete vectors to continuous mapping of to. For exampe, if P is an image of n pixes, it can be mapped to a continuousy vaued function f over 2 by convoving it with a 2-D Gaussian function g of standard deviation. This is because g is a differentiabe mapping of 2 to, and P can be interpreted as a sum of impuse functions. In the 2-D case, the new interpretation of P can be written as: Px, y Pijx iy j (50) i, j where P[i][ j] denotes the finite vector of discrete vaues, as stored in a computer. The resut of the convoution is of course differentiabe because it is a sum of Gaussian functions. The Gaussian mapping is given by: C : P 3 f Pg (51) In the 2-D case, the function f can be written as: fx, y Pijg x i, y j (52) i, j Other coding functions can be used, such as cubic spine or even biinear interpoation. Biinear interpoation between the pixes yieds a function f, which is differentiabe amost everywhere. The fact that the derivatives have two vaues at the integer ocations (because the biinear interpoation is different on both sides of each pixes) is not a probem in practice; just choose one of the two vaues. 5 These operators are said to generate a Lie agebra. This is because on top of the addition and mutipication by a scaar, there is a specia mutipication caed Lie bracket, which is defined by [L 1, L 2 ] L 1 E L 2 L 2 E L 1. In the above exampe, [L, L a ] L b,[l a, L b ] 0, and [L b, L ] L a. Figure 12. Graphic iustration of the computation of f and two tangent vectors corresponding to L x /x (X-tansation) and L x /y (Y-transation), from a binary image I. The Gaussian function gx, y exp x2 y has a standard deviation of 0.9 in this exampe athough its graphic representation (sma images on the right) have been rescaed for carity. The Gaussian mapping is preferred for two reasons. First, the smoothing parameter can be used to contro the ocaity of the invariance. This is because when f is smoother, the oca approximation of Eq. (45) is vaid for arger transformations. Second, when combined with the transformation operator L, the derivative can be appied on the cosed form of the Gaussian function. For instance, if the X-transation operator L x is appied to f P g, the actua computation becomes: L X f x Pg P g x (53) because of the differentiation properties of convoution when the support is compact. This is easiy done by convoving the origina image with the X-derivative of the Gaussian function g (Fig. 12). Simiary, the tangent vector for scaing can be computed with: L S f x x y yig xi g yi x g (54) y This operation is iustrated in Figure 13. C. Important Transformations in Image Processing. This section summarizes how to compute the tangent vectors for image processing (in 2-D). Each discrete image I i is convoved with a Gaussian of standard deviation g to obtain a representation of the continuous image f i, according to Eq. (55): Vo. 11, (2000) 193

14 L Y y (59) 3. Rotation: This transformation is usefu when the cassification function is invariant with respect to the input transformation: The Lie operator is defined by: t : x x cos y sin y 3x sin y cos (60) L R y x x y (61) 4. Scaing: This transformation is usefu when the cassification function is invariant with respect to the input transformation: t : x y 3 x x y y (62) Figure 13. Graphic iustration of the computation of the tangent vector T u D x S x D y S y (bottom image). The dispacement for each pixe is proportiona to the distance of the pixe to the center of the image (D x (x, y) x x 0 and D y (x, y) y y 0 ). The two mutipications (horizonta ines) and the addition (vertica right coumn) are done pixe by pixe. f i I i g. (55) The resuting image f i wi be used in a the computations requiring I i (except for computing the tangent vector). For each image I i, the tangent vectors are computed by appying the operators corresponding to the transformations of interest to the expression I i g. The resut, which can be precomputed, is an image that is the tangent vector. The foowing ist contains some of the most usefu tangent vectors: 1. X-transation: This transformation is usefu when the cassification function is invariant with respect to the input transformation: t : x y 3 The Lie operator is defined by: x y (56) The Lie operator is defined by: L S x x y y (63) 5. Parae hyperboic transformation: This transformation is usefu when the cassification function is invariant with respect to the input transformation: t : x y 3 The Lie operator is defined by: L S x x y y x x y y (64) (65) 6. Diagona hyperboic transformation: This transformation is usefu when the cassification function is invariant with respect to the input transformation: The Lie operator is defined by: t : x x y y 3y x (66) L X x (57) L S y x x y (67) 2. Y-transation: This transformation is usefu when the cassification function is invariant with respect to the input transformation: The Lie operator is defined by: t : x y 3 x y (58) The resuting tangent vector is is the norm of the gradient of the image, which is easiy computed. 7. Thickening: This transformation is usefu when the cassification function is invariant with respect to the variation of thickness. This is known in morphoogy as diation and its inverse, erosion. It is usefu in certain domains (such as handwritten character recognition) because thickening and thinning are natura variations that correspond to the pressure appied on a pen or to different absorbtion properties of the 194 Vo. 11, (2000)

Figure 14. Iustration of five tangent vectors (top), corresponding dispacements (midde), and transformation effects (bottom). The dispacements D x and D y are represented in the form of a vector fied.

15 Figure 14. Iustration of five tangent vectors (top), corresponding dispacements (midde), and transformation effects (bottom). The dispacements D x and D y are represented in the form of a vector fied. The tangent vector for the thickness deformation (right coumn) corresponds to the norm of the gradient of the gray-eve image. ink on the paper. A diation (resp. erosion) can be defined as the operation of repacing each vaue f( x, y) by the argest (resp. smaest) vaue of f( x, y) found within a neighborhood of a certain shape, centered at ( x, y). The region is caed the structura eement. We assume that the structura eement is a sphere of radius. We define the thickening transformation as the function that takes the function f and generates the function f defined by: f X max fx r for 0 (68) r f X max fx r for 0 (69) r The derivative of the thickening for 0 can be written as: im 30 fx fx max r fx r fx im a30 (70) f(x) can be put within the max expression because it does not depend on r. Because tends toward 0, we can write: fx r fx r fx Or 2 r fx (71) The maximum of max fx r fx max r fx (72) r r is attained when r and f(x) are coinear, that is, when r fx fx (73) assuming 0. It can easiy be shown that this equation hods when is negative, because we then try to minimize Eq. (69). Therefore: im a30 f X fx fx (74) which is the tangent vector of interest. Note that this is true for positive or negative. The same tangent vector describes both thickening and thinning. Aternativey, our computation of the dispacement r can be used and the foowing transformation of the input can be defined as: where t f: x y 3 x r x y r y (75) r x, r y r fx fx (76) This transformation of the input space is different for each pattern f (we do not have a Lie group of transformations), but the fied structure generated by the (pseudo Lie) operator is sti usefu. The operator used to find the tangent vector is defined by: L T (77) which means that the tangent vector image is obtained by computing the normaized gray-eve gradient of the image at each point (the gradient at each point is normaized). The ast five transformations are depicted in Figure 14 with the tangent vector. The ast operator corresponds to a thickening or thinning of the image. This unusua transformation is extremey usefu for handwritten character recognition. Vo. 11, (2000) 195

Nearest Neighbor Learning

Nearest Neighbor Learning Cassify based on oca simiarity Ranges from simpe nearest neighbor to case-based and anaogica reasoning Use oca information near the current query instance to decide the cassification