L E A R N I N G B A G - O F - F E AT U R E S R E P R E S E N TAT I O N S F O R H A N D W R I T I N G R E C O G N I T I O N

Size: px

Start display at page:

Download "L E A R N I N G B A G - O F - F E AT U R E S R E P R E S E N TAT I O N S F O R H A N D W R I T I N G R E C O G N I T I O N"

Ashlyn Daniel
6 years ago
Views:

1 L E A R N I N G B A G - O F - F E AT U R E S R E P R E S E N TAT I O N S F O R H A N D W R I T I N G R E C O G N I T I O N leonard rothacker Diploma thesis Department of computer science Technische Universität Dortmund November 2011

2 Leonard Rothacker: Learning bag-of-features representations for handwriting recognition, Diploma thesis, November 2011, 1st revision (January 6, 2012) supervisors: Prof. Dr.-Ing. Gernot A. Fink Dr. Szilárd Vajda

3 C O N T E N T S 1 introduction 1 2 offline handwriting recognition Overview Preprocessing and normalization Serialization and feature extraction Hidden Markov Models Conclusion 15 3 bag-of-features image representations Bag-of-features Local image features Feature detection Feature descriptors Principal component analysis Clustering and quantization Applications Image categorization Image retrieval Conclusion 38 4 character and word recognition Part-based character recognition Digit recognition Degraded character recognition Chinese character recognition Word spotting Segmentation-based word spotting Segmentation-free word spotting Conclusion 50 5 bag-of-features handwriting recognition Context-based method Overview Features Visual vocabulary and descriptor quantization Sliding bag-of-features Hidden Markov Models Holistic method Features Categorization Conclusion 62 iii

4 iv contents 6 evaluation Datasets MNIST database IAM database IFN/ENIT database Holistic method Evaluation Experiments Results Context based method Reference method Evaluation Experiments Results 70 7 conclusion 77 bibliography 79

5 L I S T O F F I G U R E S Figure 1 Variabilities in handwritten script 2 Figure 2 Overview of a handwriting recognition system 5 Figure 3 Example for baseline estimation 7 Figure 4 Serialization with a sliding window 8 Figure 5 Schematic HMM illustration 11 Figure 6 Continuous and Semi-Continuous HMMs 13 Figure 7 Bag-of-features image representations 18 Figure 8 Application of Harris Corner detector 21 Figure 9 SIFT interest point detection 22 Figure 10 Application of the SIFT detector 23 Figure 11 SIFT interest point descriptor 24 Figure 12 Example of principal component analysis 26 Figure 13 Clustering with Lloyd s algorithm 28 Figure 14 Clustering with mixture density models 30 Figure 15 Illustration of image categorization 31 Figure 16 Example for image retrieval 34 Figure 17 Schematic illustration of image retrieval 35 Figure 18 Example for spatial consistency 37 Figure 19 Part-based digit recognition 40 Figure 20 Part-based degraded character recognition 41 Figure 21 Character-SIFT feature extraction 43 Figure 22 Index generation with word spotting 45 Figure 23 SIFT-based sliding window application 47 Figure 24 Bag-of-features word spotting 49 Figure 25 Bag-of-features handwriting recognition 53 Figure 26 Feature detection in word images 54 Figure 27 Feature description in word images 55 Figure 28 Bag-of-features HMM integration 59 Figure 29 Feature description for holistic recognition 61 Figure 30 Holistic bag-of-features k-nn categorization 62 Figure 31 Samples from MNIST 63 Figure 32 Text-line image from IAM-DB 64 Figure 33 Word image from IFN/ENIT 64 Figure 34 Confusion-matrices for MNIST 66 v

7 I N T R O D U C T I O N 1 Handwriting recognition is the task of automatically transcribing handwritten text into a machine-based textual representation. A central requirement is that the process should not depend on a particular writer. In order to improve upon existing approaches, a novel integration with the bag-of-features method will be proposed. The bag-of-features method can be used for learning feature representations statistically. Feature extraction reduces the data to the information relevant for a specific recognition task. Handwriting recognition is difficult due to the great variability found in human writing. Personal writing characteristics have an important influence leading to very different visual appearances of the same characters. A recognition system consequently has to learn to distinguish these highly varying character instances from another. Figure 1 exemplarily shows two images of a single sentence. Each is written by a different writer. The visual appearances differ in slant, size and pen-stroke. In this thesis we will only consider handwritten texts given in form of images. Due to the mentioned difficulties, automatic handwriting recognition is only well established in application areas with restricted recognition domains. This refers, for example, to automatic postal address reading or bank check reading. Both recognition tasks are constrained as the objective is not to recognize arbitrary texts. Instead, some prior knowledge is taken into account in order to simplify the recognition. When the postcode of an address is identified, possible street names can be constrained. This means not all possible sequences of characters must be considered but only those matching with a dictionary. For recognizing bank checks, the writer needs to use special forms that contain boxes for every single character. With this simplification only isolated characters and digits have to be recognized. This is much easier than recognizing full words because in comparison to the number of possible word instances, the number of possible characters and digits is very limited. Also a segmentation of words into characters is not easily derivable as the character boundaries are not marked in the image. Postal address and bank check recognition can consequently be considered solved. An unsolved problem is the recognition of unconstrained handwritten texts where no prior information is available. Neither the text s topic is known nor it is known where the text is located and how it will appear. Particularly document images that are typically subject to handwriting recognition may contain other elements like figures or variability of handwritten script postal address recognition bank check recognition unconstrained handwriting recognition 1

8 2 1 introduction Figure 1: Variabilities in handwritten script. Images taken from [MB02]. hand-crafted vs learned feature representations feature learning method objectives in the thesis tables. After identifying text-regions, text-lines are segmented. For using a recognizer, some features are extracted that usually describe the pen-stroke more abstractly. Traditional feature extraction methods are hand-crafted and therefore based on intuition and expert knowledge. Given the high variability found in handwritten scripts and the difficulties in their recognition, the question rises if there is some potential for improving features rather than the recognition method. Instead of designing the features manually the idea is to learn a suitable representation. Although heuristically defined features have proven to work well in many experiments, there always remain cases in which the feature extraction method might not have been the optimal choice. It is also not directly apparent if features designed for one script are also suitable for a different script. This last aspect is of special interest because learned feature representations should be able to automatically adapt themselves thus making them applicable in different writing systems. The feature learning technique that should be integrated is widely used in image retrieval and categorization. The basic approach is to find approximations of frequently occurring small image patches. The so-called bag-of-features representation is then given by a statistic over the occurrence of these representatives. The learning process consists of estimating the representatives based on example images. In the application to images of handwritten scripts the image patches contain parts of characters. The objective in this thesis is therefore to integrate learned bagof-features representations with a handwriting recognizer. It should at least be possible to match the recognition rates of the considered reference system that is using hand-crafted features. For testing the independence of different scripts, datasets for Roman and for Arabic handwriting will be used in the evaluation. The thesis is therefore structured as follows: Chapters 2 and 3 cover the fundamentals regarding handwriting recognition and bag-of- features image representations. Related methods for part-based character recognition and word spotting are briefly discussed in Chapter 4. The integration of bag-of-features representations with a handwriting recognizer is presented in Chapter 5 and evaluated in Chapter 6. A final conclusion follows in Chapter 7. structure of the thesis

9 O F F L I N E H A N D W R I T I N G R E C O G N I T I O N 2 Offline handwriting recognition is the task of automatically transcribing images of handwritten script into symbolic text representations. It must therefore be differentiated from online systems where trajectories of pen-strokes are given. These contain all information of the entire writing process where an image, given in offline systems, only depicts the final result (cf. [MG01]). In this chapter methods for transcribing images of handwritten text are introduced. They are strongly related to the handwriting recognition system that will be used to apply automatically learned features (see Section 5.1). Furthermore, this system serves as reference in the evaluation (see Section 6.3.1). For more elaborate surveys refer to [PF09] or [Bun03]. In the following Section 2.1 an overview will be given and Sections 2.2, 2.3 and 2.4 concentrate on the relevant details. offline and online handwriting recognition 2.1 overview Automatic handwriting recognition is difficult due to the great variability found in human writing. Additionally, the appearance of characters varies and is dependent on their local context. In order to cope with these challenges, a typical handwriting recognition system applies a series of methods. After normalizing the images with respect to the variabilities mentioned, a representation is extracted that can be used in a handwriting recognizer. The final outcome is a transcription of the text image. In the remainder of this section three general approaches to handwriting recognition will briefly be introduced. Afterwards, the overview concludes with a discussion of different steps in an overall system. The three approaches to handwriting recognition that will be presented next are referred to as holistic methods, segmentation-based methods and segmentation-free methods. Holistic methods try to classify a word as a whole. This only works for recognizing small numbers of different words because each word has to be distinguished from all other words (cf. [MG01]). The amount of words that exist in a language and the amount of words that can be handled by such a recognition system differ highly. Basic systems discussed in [MG01] work with 100 to 1000 words in their lexicon, unless prior knowledge is included for restricting the lexicon dynamically. In contrast, according to the Oxford Dictionary, the English language contains 171,476 actively used words [web11]. For applications requiring the transcrip- holistic methods 3

10 4 2 offline handwriting recognition segmentation-based methods tion of unconstrained texts, holistic word recognition is consequently insufficient. Another approach is to use segmentation-based methods. Here the words to be classified are segmented into smaller subunits. These can be recognized and used to transcribe the respective word (cf. e.g. [GKK + 97]). The problem with cursive script is to find segments that can be successfully classified by the recognizer. Words and characters are written continuously and in close agglomeration. Their boundaries are often not directly detectable. Segmentation-free methods avoid segmenting handwritten text be- fore recognition. Hidden Markov Models (HMMs) are a prominent representative in this regard. HMMs have been successfully applied in speech recognition and are well established in handwriting recognition (cf. e.g. [HAJ90, Fin08, PF09]). HMMs model the elementary units in the application domain. In handwriting these are usually characters but recognition on word level is also possible (cf. [PF09]). Their internal structure consists of states describing the stochastic generation of observation sequences. This means that the processed data needs to be represented in terms of such observations. Here these are of vector valued nature. Using probabilistic inference, the state sequence can be decoded that most likely generated the observation sequence representing the actual data. Given the association of observations and internal HMM states, the data s segmentation can be derived additionally (cf. e.g. [HAJ90, Fin08]). Hidden Markov Models will be further discussed in Section 2.4. In all three approaches a statistical model must be estimated for recognition. This is based on training data. A disjunct test dataset is used for evaluation. After this short introduction to different recognition methods, the typical architecture of a segmentation-free handwriting recognition system will be discussed next. Three major parts can be identified: segmentation-free methods Hidden Markov Models training and testing architecture of a segmentation-free handwriting recognition system 1. Preprocessing and normalization: Preprocessing mainly refers to the identification and extraction of text-lines from document images. Variabilities in the obtained text-line images that are irrelevant to the recognition process are reduced in the normalization (see Section 2.2). 2. Serialization and feature extraction: An HMM models the stochastic generation of observation sequences. In handwriting recognition these will be given by feature vectors. A serialization method is applied in order to obtain such a sequence from textline images. Feature vectors are supposed to encode information that is relevant for recognition (see Section 2.3). 3. Recognition: Given the observation sequence, the state sequence in the HMM that most likely generated the observation sequence is decoded. The process is also referred to as model decoding

11 2.2 Preprocessing and normalization 5 Figure 2: Overview of a typical HMM-based handwriting recognition system. Different steps in the pipeline architecture are visualized by boxes. In between, results are illustrated by examples. Based on image from [PF09]. (cf. [PF09]) because the state sequence is hidden and has to be uncovered by probabilistic inference (see Section 2.4). An overview is shown in Figure 2. Preprocessing and normalization methods presented in Section 2.2 are indicated in the dashed box. They are relevant for offline handwriting recognition in general. The following two boxes refer to methods discussed in Sections 2.3 and 2.4. In contrast, they are specific to HMM-based systems. 2.2 preprocessing and normalization In general, handwriting recognition methods perform preprocessing and normalization techniques. Based on images of handwritten text, representations are concentrated on information that is relevant for recognition. This may refer to removing artifacts caused by image acquisition or to reducing the variability found in different writing styles. The outcome is images of normalized handwritten text-lines. Figure 2 shows an overview of the architecture of an HMM-based offline handwriting recognition system. For the current discussion please refer to the steps aggregated by the dashed box. When scanning handwritten text or taking pictures, images are usually acquired from complete documents. These often contain objects other than text, for example images, tables or black borders from the scanning process. By performing a document layout analysis, text elements are identified and non-text elements are filtered. In Figure 2 this step is referred to as text detection and directly follows the image acquisition. A survey discussing document structure analysis can be found in [MRK03]. Images of documents are usually represented by gray value intensities. For the text recognition process only the pen-stroke itself but not the distribution of different gray values is important. Gray values in text-line images are therefore thresholded to zero or one respectively. It is simplest to set the threshold manually but this way the document layout analysis binarization

12 6 2 offline handwriting recognition Otsu method Niblack method line extraction Because text is usually oriented horizontally, lines of text have to be extracted in the next step. Script oriented vertically will not be considered here, but can be treated analogously. Different text-lines can be identified with projection profiles. These are common in hand- written script analysis in general and can also be used for baseline and slant estimation (see below). In the application to line extraction, pixels in each row of the text-line image will be accumulated. High counts in the profile indicate text and low counts a gap between lines. Usually the distinction is based on a heuristically chosen threshold. Due to their row-wise application they will be referred to as horizontal projection profiles. In column-wise applications they will be referred to as vertical projection profiles, respectively. In Figure 2 text-lines have been extracted from the previously localized text region. For further details and different text-line extraction methods refer to [LZT07]. At this point the original document has been reduced to its relevant parts. For handwriting recognition only the text-line images will be considered any further. projection profiles text-line normalization pen-stroke might get damaged in low contrast document images. The Otsu method is very common for determining the threshold automatically. It is estimated based on the histogram of image intensity values [Ots79]. Further improvement can be achieved with locally adaptive methods that are able to handle inhomogeneities like fading pen-strokes or low contrast areas within a document. These methods calculate a per-pixel threshold depending on local pixel neighborhoods. It is possible to apply Otsu s method locally but due to speed considerations the Niblack method is more common. It estimates thresholds based on image moments in the local pixel neighborhood. For a more comprehensive discussion on adaptive document binarization techniques and a review of Niblack s method refer to [SP00]. Note that text-line examples in Figure 2 have been binarized with the global Otsu method. Limitations in low contrast areas are noticeable at the beginning of the second and third line. The next step, present in almost every handwriting recognition system, is text-line normalization. Typical variations in writing can be removed thus they will not be included in a later feature representation. This simplifies the recognition. Typically the following variabilities are treated: rotating words Baseline variations: The text is not written on a straight horizontal line but varies vertically. An example is given in Figure 2. The last extracted text-line has a strong curvature. Normalization is achieved by rotating words until their estimated baselines match the extrema of the character contour. Figure 3 shows an example for baseline estimation. Horizontal projection profiles are used in order to estimate the word s core region. From its upper and lower bounds, baseline estimations can be derived [VL01].

13 2.3 Serialization and feature extraction 7 Figure 3: The image shows the analysis of the word s core region for baseline estimations. On the right the horizontal projection profile is shown. The core region is found in the high density area exceeding the threshold. It is enclosed by the lower and upper baseline. Baseline estimates in the ascender region can be discarded due to their weaker intensities in the projection profile. Image taken from [VL01]. Slant variations: Humans mostly write cursively forming italic characters. By shearing characters to an upright position variabilities can be reduced. In [VL01] character orientation is measured by vertical projection histograms. Size variations: Size normalization is crucial to a handwriting recognition system [PF09]. Different writing styles may differ largely in size and this also affects the respective feature representation. Normalization is usually based on the core region s size, also referred to as core size [PF09]. An example for core region estimation can be found in Figure 3. shearing characters scaling relative to character core size The presented preprocessing and normalization methods are relevant for offline handwriting recognition in general. In the remainder of this chapter we will focus on HMM-based handwriting recognition. 2.3 serialization and feature extraction HMMs model the generation of observation sequences stochastically. For an application in handwriting recognition, images need to be serialized into a sequences of feature vectors accordingly. The use of a sliding window for that purpose is widespread (cf. e.g. [MB00, Bun03, PM03, WFS05]). Therefore, the window usually has the same height as the image and covers a small portion of a character. The window is then slid over the text-line image in writing direction, i.e. e.g. leftto-right for Roman and right-to-left for Arabic script. Subsequent windows overlap. In practice, parameters may vary from system to system and may also depend on previously applied normalizations. For instance, in [WFS05] a four pixel window with 50% overlap is used. Feature representations are then calculated from the window content at each window position. In text-line images the content is a pen-stroke slice unless the window is slid over entirely white image regions. Figure 4 shows an example. sliding window

8 2 offline handwriting recognition Figure 4: The image of handwritten script is serialized by a sliding window. At each window position a feature representation of the current content is extracted.

14 8 2 offline handwriting recognition Figure 4: The image of handwritten script is serialized by a sliding window. At each window position a feature representation of the current content is extracted. The window function is indicated by the rectangular box function clipping script accordingly. For a better visualization the illustrated window is wider than in practical applications. Image of script taken from [MB02]. feature extraction discriminating features In [PF09] analytical features are distinguished from geometrical features. Analytical features are directly based on pixels within win- dows. By applying the principle component analysis (see Section 3.3) data can be decorrelated and its dimensionality can be reduced. For example, in [PM03] vector representations are obtained by concatenating pixels from a three column sliding window. Feature vectors are then calculated according to principle component analysis. The transformation is estimated on training data. An approach based on linear discriminant analysis is presented in [DHU97]: After obtaining the segmentation of training data by using a reference system, feature vectors can be labeled according to their associated character. With linear discriminant analysis a transformation is estimated that aims at fulfilling exactly the initial requirement for feature representations: analytical features principle component analysis linear discriminant analysis All information necessary for recognition is present in the pixels within the sliding window. But in this high dimensional representation also information is encoded that is irrelevant or makes recognition even more difficult due to further variabilities. Feature extraction is therefore important to find a low dimensional representation containing the information relevant to the recognition. This is usually achieved by abstracting from the raw data. Ideally features corresponding to similar pen-stroke slices lie compactly in feature space and far away from features of different pen-stroke slices. This means the feature representation should contain the information defining a pen-stroke within a sliding window. Furthermore, a low dimensional feature representation is desirable for a robust HMM-parameter estimation [PF09]. The number of model parameters directly depends on the feature dimensionality. A robust estimation is only possible with a sufficient amount of training samples per model parameter.

15 2.3 Serialization and feature extraction 9 Features of one class should be compact in feature space and features of different classes far away from each other. Geometrical features are chosen heuristically and characterize the pen-stroke in the respective window. There are many different approaches that researchers have developed. In all cases their motivation is based on intuition and expert knowledge. Their applicability has been proven in experiments. Nevertheless, they are widely used and produce state-of-the-art results. In order to give a few examples, geometrical features from [WFS05] will be presented. The basic idea was first introduced in [MB00]. The following features are calculated for each column of the sliding window. geometrical features column-wise features 1. Number of transitions from black to white. 2. Distance between the gravity center of black pixels and the baseline. 3. Distance between the topmost black pixel and the baseline. 4. Distance between the lowermost black pixel and the baseline. 5. Relative number of black pixels with respect to the number of pixels between the lowermost and the topmost pixels. 6. Relative number of black pixels with respect to the number of pixels in the column. Features one to four are normalized with respect to the core size. In order to obtain one feature vector per window, the six features are averaged over window columns. Additionally, information about the script contour directions is included. Lines are therefore estimated within the sliding window between script-contour features 7. the topmost contour points, 8. the lowermost counter points, 9. and the gravity centers. The line orientations will be used as features. Finally, the temporal context among adjacent sliding windows is taken into account by computing horizontal derivatives of their 9-dimensional feature vector representations. This way an 18-dimensional feature vector is obtained for each sliding window. The derivatives are very helpful to analyze the dynamic properties of the script. They are useful for separating characters with homogeneous and inhomogeneous appearances as well as character transitions. In [WFS05] an analytic approach is set on top of geometric features. Linear discriminant analysis is applied according to [DHU97]. dynamic features

16 10 2 offline handwriting recognition 2.4 hidden markov models sequence modeling two-stage stochastic process observation modeling first-order HMM Markov assumption Hidden Markov Models are well suited for modeling temporal sequences. There are various fields of application like speech recognition or DNA sequence analysis (cf. [Fin08] Chap. 2) but we will limit the discussion to aspects relevant for handwriting recognition. In this regard sequences of characters or, considering even smaller entities, subunits forming characters are modeled. With respect to the example of character sequences, it is quite intuitive that some arrangements are more frequent than others. In the English language ea has a much higher occurrence probability then ae, for instance. These principles can be detected and modeled by an HMM. Their generation can be interpreted as the first stage in a two-stage stochastic process (cf. [Fin08] Chap. 5). In our application to handwriting recognition we are not able to directly observe these hidden character sequences but rather their feature representations (see Section 2.3). For modeling this important aspect, a second stage in the stochastic process is needed. It refers to the generation of an observation at each point in time. As discussed in Section 2.3, feature vectors should be representative for the class or entity they belong to. This aspect has to be captured here. The occurrence probabilities of different observations depend on the entity they ought to represent. HMMs for handwriting recognition model these entities by states and state sequences by transitions carried out between them with certain probabilities. This way at each point in time one state is active and generates an observation with respect to its associated continuous probability distribution. The HMMs are of first order thus there is an assumption for each stage in the stochastic process. In the first stage the HMM uses a first order Markov process (cf. eg. [HAJ90, Fin08] Chap. 5). It fulfills the Markov assumption because the active state is only dependent on a limited number of predecessors. In this case it is only the immediate predecessor (cf. [HAJ90] Chap. 5). Therefore, S t is a discrete random variable over a finite set of states. It denotes states being active at time t. The conditional probability in Equation 2.1 only depends on the active state at time t 1 thus it does not change if more predecessors are considered. P(S t S 1,... S t 1 ) = P(S t S t 1 ) (2.1) robust model estimation While this assumption limits the memory of the HMM along with its descriptive capabilities, it also limits the number of model parameters that need to be estimated from the training dataset ([HAJ90] Chap. 5). As already mentioned: The number of model parameters and the number of training samples necessary for their robust estimation are correlated. The bigger the dataset the harder its acquisition and maintenance.

17 2.4 Hidden Markov Models 11 Figure 5: HMM visualization as generative finite state machine. Circles indicate states. Arrows between them indicate possible transitions in this example. Note that different transition models are possible. The output generation is outlined by arrows beneath states. Based on image from [PF09]. The second assumption regards observation modeling and is referred to as output-independence assumption (cf. [HAJ90] Chap. 5). Therefore, let x 1,... x t denote the sequence of observations, i.e. feature vectors, until time t (Equation 2.2). output-independence assumption p(x t x 1,... x t 1, S 1,... S t ) = p(x t S t ) (2.2) The output density at time t only depends on the active state. Neither on previous observations nor previous states. Figure 5 shows a scheme outlining the basic elements constituting the HMM. The visualization highlights that the HMM is a generative finite state machine. In the example, transitions can be carried out to the active and its succeeding state. Output probability density functions are associated with all states. The following summary provides a more formal description of all respective components: A set of states S = {s 1... s N }. In Figure 5 these are illustrated by circles. States are defined by the task. Recognition on character level, for instance, requires a model per character in contrast to recognition on word level. The number of states for each model is usually given manually. Multiple HMMs can be concatenated into a bigger network. generative finite state machine character and word models Transition probabilities between states (Equation 2.3). A = {a ij a ij = P(S t = s j S t 1 = s i )} (2.3) a ij is the probability of s j being the active state at time t if at time t 1 state s i has been active. Arrows between states in Figure 5 indicate transitions. Note that all transition probabilities for any state have to accumulate to one. The illustration shows a so-called linear model where transitions go back to the same state and to the next state. In the more complicated Bakis model transitions may additionally go to the next but one state. This way there is a probability for skipping states. For further information refer to ([Fin08] Chap. 8). linear model Bakis model

18 12 2 offline handwriting recognition Output probability density functions for each state (Equation 2.4). b j (x) = p(x S t = s j ) (2.4) Vectors x are distributed with respect to a given statistical model, usually mixtures of Gaussian distributions. In continuous mix- ture HMMs each state has its own model parameters in contrast to semi-continuous mixture HMMs where some parameters are shared ([HAJ90] Chap. 7). Figure 5 indicates the generation of observations by big arrows beneath states. continuous and semi-continuous mixture HMMs Start probabilities for each state (Equation 2.5). π = {π i π i = P(S 1 = s i )} (2.5) π i is the probability of a sequence starting with state s i. output modeling Gaussian distribution expected value covariance matrix The parameter set describing a complete HMM is denoted by λ. For modeling output densities, Gaussian mixture models are most common. As their name indicates, a given number of Gaussian distributions is combined in one probability density function. Gaussian components are defined by a mean vector µ and a covariance matrix C. If vectors x are drawn according to multivariate random variable X, µ is equivalent to the expected value of the actual distribution p X (Equation 2.6). C describes the pairwise correlations between the feature vector components, where ( ) T is the transpose (Equation 2.7) (cf. eg. [Fin08] Chap. 3). For practical applications p X is unknown thus µ and C have to be estimated accordingly. µ = x p X (x) dx (2.6) x R n C = (x µ)(x µ) T p X (x) dx (2.7) x R n The Gaussian distribution is defined in Equation 2.8, where 2πC denotes the scaled determinant. N (x µ, C) = 1 2πC e 1 2 (x µ)t C 1 (x µ) (2.8) continuous mixture HMM The mixture is then defined by integrating a given number of Gaussian mixture components in a convex linear combination. The approach presented so far refers to continuous or semi-continuous mixture HMMs. In a continuous mixture HMM for each state and each of its mixture components, an n-dimensional mean vector and the symmetric n n covariance matrix have to be estimated, where n denotes the dimension of the feature vector. Also the number of mixture components M j is given per state. Mixture weights c [0, 1]

2.4 Hidden Markov Models 13 (a) Continuous HMM (b) Semi-Continuous HMM Figure 6: Comparison between continuous and semi-continuous HMMs. Figure 6a illustrates the continuous case. For each state s 1,.

Only the mixture coefficients are estimated individually. Images taken from [Sch95]. need to accumulate to one in order to normalize the resulting mixture accordingly (Equation 2.9).

19 2.4 Hidden Markov Models 13 (a) Continuous HMM (b) Semi-Continuous HMM Figure 6: Comparison between continuous and semi-continuous HMMs. Figure 6a illustrates the continuous case. For each state s 1,... s 3 an individual mixture of Gaussian distributions g jk is estimated. In the semi-continuous case (Figure 6b) the Gaussian distributions are shared by all states. Only the mixture coefficients are estimated individually. Images taken from [Sch95]. need to accumulate to one in order to normalize the resulting mixture accordingly (Equation 2.9). b j (x) = M j M j c jk N (x µ jk, C jk ) = c jk g jk (2.9) k=1 k=1 In Figure 6a this is illustrated for three states. Each is using a mixture of three Gaussian distributions. Also compare coefficients and parameters in Equation 2.9 that are dependent on a specific state s j. Robustly estimating all these parameters requires many samples in the training dataset. Semi-continuous mixture HMMs have first been introduced in [HAJ90]. Instead of estimating a complete Gaussian mixture model for each state, only the mixture weights are estimated individually. The Gaussian mixture components are shared by all states. Figure 6 shows a comparison between the methods. Equation 2.9 can thus be simplified: In Equation 2.10 parameters of different Gaussians only depend on mixture component k but not on state s j. Also the number of mixture components M does not depend on a specific state anymore. b j (x) = M M c jk N (x µ k, C k ) = c jk g k (2.10) k=1 k=1 This way a substantial reduction of the number of free parameters is possible and the parameter estimation is less problematic. One mixture model can describe the complete training dataset thus training data tends to be equally distributed among mixture components. In the continuous HMM case a sufficient number of samples per state is necessary for robust estimations ([Fin08] Chap. 9). Further parameter reduction can be achieved when using only diagonal instead of full covariances (also compare Section 3.4). This way semi-continuous mixture HMM diagonal vs full covariances

20 14 2 offline handwriting recognition The last aspect of reducing free parameters has already been in- troduced in the discussion of transition probabilities. Reduction is achieved by not allowing transitions from one state to any state, but constraining them according to the chosen model. Also compare the outlined transitions in Figure 5. restricting state transitions HMM parameter estimation model decoding optimal production probability brute-force model decoding the matrix describes only scatter within different components from the feature vector not between them. In practice, data can still be modeled sufficiently due to the use of many elements composing the mixture. For estimating HMM parameters, the so-called Baum-Welch algorithm is used. It iteratively refines the model to maximize the probability of generating samples from the training dataset. An important issue in the iterative procedure is to specify an initial solution. Random initialization is non-optional because the algorithm converges in local optima that might be far from a globally optimal solution. A common initialization technique is to quantize the data with respect to a given number of mixture components and estimate their initial parameters based on the clusters obtained (see Section 3.4). For further details refer to ([Fin08] Chap. 9). Finally, we will address how to decode the model. As mentioned initially, when given a sequence of observations, the state sequence that most likely generated that observation sequence must be decoded. Because the HMM is a generative model this is only possible by probabilistic inference. This way not only the state sequence with highest so-called optimal production probability can be discovered but also an association of observations from the observation sequence with states from the optimal state sequence can be derived. Consequently a segmentation is obtained. A very naive approach is to simply traverse the model using all possible state sequences of a given length. For each point in time, the probability of generating the observation in a particular state with respect to the preceding state is determined. The overall probability for a state sequence generating the observations is then calculated by their product. Let therefore denote O the sequence of observations x 1,... x T, s a specific state sequences of same length, a 0i = π i the start probability for state s i, s 0 := 0 respectively and λ the HMM (Equation 2.11). P(O, s λ) = T t=1 s = argmax S a st 1,s t b st (x t ) (2.11) P(O, s λ) (2.12) Then in Equation 2.12 the state sequence s with optimal production probability can be determined by simply examining all possible state computational sequences S. However, for an observation sequence of length T and complexity N states to be traversed, the computational complexity is O(TN T ) ([Fin08] Chap. 5). For practical applications this is intractable.

21 2.5 Conclusion 15 Instead, the efficient Viterbi algorithm is used. Rather than evaluating all possible state sequences, for each state only the locally optimal solution for the next state in the sequence is considered. This is sufficient due to the use of first order HMMs that limit their memory to a single state. It is irrelevant how the process arrived at the current state to decide which state will be next. Consequently at each point in time a locally optimal decision must be made among all N states in the model. The globally optimal state sequence can then be inferred by back-tracking the local decisions. This means locally optimal sequences are recombined to a globally optimal solution. The principle goes back to dynamic programming and Bellman s principle of optimality. The computational complexity can thus be reduced to O(TN 2 ) ([HAJ90] Chap. 5). It is important to note that decoding the globally optimal state sequence is only possible after the complete observation of feature vectors. Before, only statements regarding locally optimal solutions are possible. This is especially a limitation for interactive systems requiring an iterative process. Further details can be found in ([Fin08] Chap. 5). Viterbi algorithm dynamic programming 2.5 conclusion Offline handwriting recognition methods, especially Hidden Markov Models, are the basis for the later integration of the bag-of-features approach. In particular, observation modeling will be important in this regard. The thoroughly discussed geometrical features will be replaced by a learned feature representation. It is therefore important to understand what those heuristic features can accomplish and where their limitations are. Their hand-crafted design is not necessarily adaptable to other application domains like different scripts. This has to be proven experimentally first. Also, they might be well suited in general but due to the great variability found in handwritten scripts many variants might not be represented sufficiently. The bag-of-features representations will try to cope with this by statistical estimations instead of expert decisions.

23 B A G - O F - F E AT U R E S I M A G E R E P R E S E N TAT I O N S 3 Bag-of-features approaches have widely been used for image retrieval, object detection and classification [OD11]. In this context, retrieval and object detection tasks aim at finding images similar to a given query image or object [SZ03]. Image classification tasks, in contrast, assign a formerly known category to the query image [CDF + 04]. 3.1 bag-of-features In general, the bag-of-features approach is derived from the bag-ofwords method found in textual semantic analysis or text retrieval (cf. e.g. [SZ03, OD11]). For an easier introduction, the principle will be illustrated by a brief discussion of bag-of-words. Afterwards, analogies to applications with images instead of text will be given. A typical bag-of-words retrieval system consists of the following steps: bag-of-words 1. Documents forming the retrieval database are represented by lists of their words. 2. Words are replaced by their stems. For example, walking is replaced by walk. 3. The most frequent and therefore not discriminative words are removed. Words that are likely to occur in many documents are, for example: the, a or and. 4. Now, each document can be represented by its stem counts. The different stems occurring in the database form the vocabulary. 5. In order to finally retrieve documents from the database, a stem vector has to be produced from a given query. By using efficient indexing methods, the documents having most similar stem vector representations will be returned. Similarity has to be measured by a suitable distance function. If the query is a document itself, the prior steps simply need to be applied. In this case the stem vocabulary that was created previously is used. Note that the representation does not contain any information about the order of words and is therefore called bag-of-words. In order to generalize this concept to images, first an analogy to the formerly used vocabulary must be defined. We will refer to such a vocabulary as visual vocabulary. A visual vocabulary is usually created from local image features (see Section 3.2) [OD11]. These features are mainly characterized by a spatial location within the image and bag-of-features local image features 17

24 18 3 bag-of-features image representations 1. Build visual vocabulary Image database Extract local image descriptors Descriptors (entire set) Cluster Visual vocabulary 2. Assign descriptors and generate bag-of-visual-words representation Bag-of-visual-words (Term vector) Image Extract local image descriptors Descriptors Nearest neighbors Visual words w1 w2 w3 w4 w5 w6 Figure 7: For bag-of-features image representations, a visual vocabulary needs to be created from local image descriptors. Afterwards, images can be described by the number of occurrences of visual words from this vocabulary. The visual words are considered features in the more general term bag-offeatures. Based on image from [OD11]. The visual vocabulary can then be obtained by the centroids that re- sult from clustering local image descriptors. This step corresponds to replacing words by their stems in text retrieval (see step 2). Stems represent related words from the natural language where representatives for vector valued descriptors need to be determined by clustering techniques. An important parameter is the number of these visual words in the vocabulary. A vocabulary which is too big does not generalize well enough in the recognition process because visual words represent non-discriminative details. If, on the other hand, the vocabulary is too small, important aspects might get lost: The representation will not be discriminative enough. In order to finally represent the image, its descriptors are quantized with respect to the previously created visual vocabulary (see Section 3.4). visual vocabulary visual words term vector term weighting a description of their local neighborhood. This so-called descriptor will be used for bag-of-features image representations. Other features covering more global aspects of the image can be used in general but are not common in image retrieval and categorization. The bag-of-features statistic, also referred to as term vector, now reflects the number of occurrences of each visual word or feature. Analogously to the bag-of-words method, the representation does not contain any information about the spatial arrangement of the descriptors. Figure 7 illustrates the process. The upper part shows the visual vocabulary creation. Afterwards, the actual representation of an image in terms of bag-of-features is explained. In order to emphasize the influence of more discriminant visual words and reduce the influence of less discriminant visual words, an additional term weighting scheme can be applied (see Section 3.5).

25 3.2 Local image features 19 Usually visual words occurring in few images are considered more discriminative than visual words occurring in many images. For applying bag-of-features image representations (see Section 3.5) two major aspects are discussed in the literature (cf. [SZ03, CDF + 04]): 1. Finding similar images according to the bag-of-features representation. interpreting bag-of-features 2. Finding the meaning of a particular visual word with respect to a given topic. The first aspect is mostly found in image retrieval where images, semantically similar to a given query image, should be retrieved [SZ03]. Here it is important to define a suitable distance measure (see Section 3.5.2). Additionally, adequate subspace representations can be considered (see Section 3.3). The second aspect is important in image categorization [CDF + 04]. The objective is to assign an image to a given category. Therefore, the semantic meaning of each visual word is statistically learned beforehand. image retrieval image categorization 3.2 local image features In contrast to holistic approaches where images are represented as a whole [TP91], local image features describe interesting image patches. The idea is that not all areas are equally important and it is more suitable to focus on those local prominent parts. The process of locating these regions is generally referred to as feature detection (cf. [OD11]) (see Section 3.2.1). Technically only single image points are detected and the process is therefore also called keypoint (cf. [Low99]) or interest point (cf. [MS04]) detection. Afterwards, a description of the interest points neighborhood is provided by the feature descriptor (cf. [OD11]) (see Section 3.2.2). That description ought to be discriminative for the underlying patch while at the same time being robust to variability. This means that a patch appearing in one image should be recognizable by its feature description even if it appears differently in another image. Those differences might occur due to changes in illumination in the scene or movement of the camera or object. An object, for example, might have moved or turned and will therefore appear scaled or rotated. Robustness to these transformations will be referred to as invariance. In the application of descriptors on handwritten script rotation invariance will be used to compensate for local slant (see Chapter 5). feature detection feature descriptor invariance to image transformations Feature detection The objective of feature detection is to find image points with neighborhoods that are discriminative for the processed image. Interest

26 20 3 bag-of-features image representations points usually lie near edges and corners because image content is mostly described by these structures. In this section the mathematically well defined Harris Corners [HS88] and SIFT (Scale Invariant Feature Transform) [Low99, Low04] detectors will be discussed. Harris Corners Moravec corner detector Harris corner detector The Harris Corners detector was the basis for developments [MS04] that are used nowadays, for example, in image retrieval [SZ03] and categorization [CDF + 04]. With respect to the application in handwritten scripts, the method gives a dense interest point coverage of the pen-stroke. This will be useful when creating a bag-of-features statistic (see Chapter 5). The fundamental idea of the Harris corner detector is to perform a local autocorrelation analysis of the image. Therefore, Harris and Stephens improve upon Moravec s corner detector [HS88]: Moravec analyzes the amount of change in intensity when shifting a window within the image in 45 degree directions. A high change in one direction indicates an edge and a high change in two directions indicates a corner. The amount of change can be determined by evaluating an error metric (Equation 3.1). E u,v = w x,y I (x + u, y + v) I (x, y) 2 (3.1) x,y I denotes the image intensity, (x, y) iterate over the image, (u, v) indicate the shift and w is a two dimensional box window function that is used to clip the currently processed part of the image. When evaluating the metric for each window position, a corner is detected if the minimum change within shifts is a local maximum bigger than some threshold. The following three improvements to the approach have been suggested by Harris and Stephens: 1. In order to cover more than 45 degree shifts, I (x + u, y + v) I (x, y) is approximated by a Taylor expansion. 2. In order to achieve a less noisy response, the box window function is replaced by a two dimensional Gaussian (Equation 3.2). w x,y = e (x2 +y 2 )/2σ 2 (3.2) [ September 23, 2011 at 16:48 ] 3. In order to reduce the responsiveness to edges, not a large minimum of E but rather an eigenvalue analysis of the autocorrelation function is used as corner measure. Two large eigenvalues indicate a corner. For further details refer to [HS88]. Figure 8 illustrates applications of the Harris corner detector to an artificial image and an image of

3.2 Local image features 21 (a) Artificial image (b) Handwritten Arabic word [PMM + 02] Figure 8: Examples for the application of the Harris Corners detector.

handwritten Arabic script. The former shall demonstrate the corner detection capabilities with respect to the previously introduced corner measure.

27 3.2 Local image features 21 (a) Artificial image (b) Handwritten Arabic word [PMM + 02] Figure 8: Examples for the application of the Harris Corners detector. Figure 8a shows an artificial image to demonstrate how the detector responds to corners rather than edges. Figure 8b shows an application to handwritten Arabic script. handwritten Arabic script. The former shall demonstrate the corner detection capabilities with respect to the previously introduced corner measure. The latter shows that a dense interest point coverage of handwritten script is possible where highly structured regions are covered by more interest points than other less structured regions. Harris corner examples SIFT detector The SIFT detector [Low99] is one of the most widely used interest point detectors [OD11]. Although we will see that for images of handwritten script the resulting interest point coverage is not dense enough, they will be used for a scale estimation of the processed script (see Chapter 5). SIFT interest points are detected in a difference-of-gaussian scale space representation of the image. This way scale invariance can be achieved because an object captured at different scales will be represented at different layers in the scale space thus having similar descriptor representations (see Section 3.2.2). In order to create the scale space representation, the image is first convolved with Gaussian functions of increasing variance σ. scale invariance L (x, y, σ) = G (x, y, σ) I (x, y) (3.3) G (x, y, σ) = 1 2πσ 2 e (x2 +y 2 )/2σ 2 (3.4) Equation 3.3 defines the Gaussian scale representation, where is the convolution operation, I the image and the Gaussian G is given in Equation 3.4. Please note the difference to Equation 3.2, where the scaling term 1/2πσ 2 is omitted due to the usage as a window function. Figure 9a illustrates this step on the left side. Note that for performance reasons the filtered images L can be sub-sampled with each doubling of σ. Therefore, images having the same dimension are part of the same octave. Up to this point the scale space representation contains much irrelevant information. In order to realize a bandpass that extracts edges at different scales, a difference-of-gaussian representation is Gaussian scale space difference-of- Gaussian scale space

The arrow indicates the direction of scale growth. On the right the subtraction for the differenceof-gaussian images is shown by arrows and the minus operation.

28 22 3 bag-of-features image representations (a) Scale space representation (b) Non-extremum-suppression Figure 9: SIFT interest point detection. Figure 9a illustrates the creation of the difference-of-gaussian scale space showing two octaves. Stacked layers on the left refer to the Gaussian smoothed images. The arrow indicates the direction of scale growth. On the right the subtraction for the differenceof-gaussian images is shown by arrows and the minus operation. The two octaves are visualized by differently sized layers. Figure 9b shows the interest point candidate detection in the difference-of-gaussian scale space by nonextremum-suppression. The cross indicates the currently processed point. If it is a maximum or minimum within the green points, it will be a candidate. Images are taken from [Low04]. used. For the purpose of interest point detection, this leads to stable image features [Low04]. The difference-of-gaussian convolution kernel is simply obtained by subtracting two Gaussian functions that differ in σ by the constant multiplicative factor k (Equation 3.5). D G (x, y, σ) = G (x, y, kσ) G (x, y, σ) (3.5) [ September 26, 2011 at 12:22 ] Also note that D G is a close approximation of the Laplacian-of-Gaussian function that was found to be suitable for interest point detection by Mikolajczyk and Schmid [MS04]. The difference-of-gaussian scale space representation is then given by convolving the image I with convolution kernels D G for different scales σ (Equation 3.6). D (x, y, σ) = D G (x, y, σ) I (x, y) (3.6) interest point detection Figure 9a shows the difference-of-gaussian scale space construction on the right. For efficiency reasons the scale representations D are calculated by subtracting adjacent images in the Gaussian scale space representation. This is equivalent to convolving the image with the corresponding kernel D G. Now, interest points can be detected. As Figure 9b shows, candidate points are obtained by a three dimensional non-extremum-suppression in the difference-of-gaussian representation. For each point the local neighborhood in the same and adjacent scales is examined. Candidate points are discarded if they are located in low contrast areas or on

The results in Figure 10a and Figure 10b are based on the same images as used in Figure 8. edges. This increases the stability of features.

29 3.2 Local image features 23 (a) Artificial image (b) Handwritten Arabic word [PMM + 02] Figure 10: Examples for the application of the SIFT detector. The size and direction of the arrows correspond to the scale and orientation of the respective interest point. The results in Figure 10a and Figure 10b are based on the same images as used in Figure 8. edges. This increases the stability of features. Features are considered stable if they can be found in different scenes showing the same object. So far, interest points have been determined in terms of location in the image and scale. In order to additionally achieve rotation invariance, the interest point s principle orientation is calculated. Therefore, gradient orientations in the local neighborhood are weighted and accumulated into a histogram. The weight for orientations is determined by their gradient magnitudes and a two dimensional circular Gaussian (see Equation 3.2) centered at the interest point location. Then the highest peak in the histogram reflects the main orientation. For higher accuracy the orientation that will be assigned to the interest point is interpolated over the adjacent histogram bins. If multiple high peaks can be detected in the histogram, multiple interest points at the same location and scale but with different orientations will be generated. For further details refer to [Low99, Low04]. Figure 10 shows an application of the SIFT detector to an artificial image and to an image of a handwritten Arabic word. The results are based on the same images as shown in Figure 8. Figure 10a illustrates how SIFT interest points lie close to edges but not on them and homogeneous areas are avoided. Figure 10b shows how the pen-stroke in images of handwritten script is sparsely covered with interest points. In richly textured images, in contrast, much more interest points will be detected. Additionally, in both images can be observed how multiple interest points are created at the same location. rotation invariance SIFT detector examples Feature descriptors The objective of the feature descriptor is to provide a discriminative representation of a local region within an image. Therefore, a mapping from image space to a vector space is defined. For feature descriptors this space is referred to as descriptor space. The simplest methods are directly based on image intensities or their eigenspace representations. More sophisticated methods are based on image gradients (cf. [OD11]). An important property is the invariance of the descriptor descriptor space

30 24 3 bag-of-features image representations descriptor invariance with respect to some transformations in image space. This means that local patches might undergo these transformations having very different image representations, but on the other hand, having similar representations in descriptor space. In many applications this is desirable because semantically equal objects appearing differently can still be recognized. Examples for those transformations are changes in illumination, scaling, rotation [Low99] (see Section 3.2.2) or even affine transformations in general [MS04]. For further improvement descriptor subspace representations are being discussed in the literature [KS04]. In the following section the SIFT descriptor [Low04] will be reviewed as an example for a very popular gradient based method. For a brief overview of the field and further references refer to [OD11]. SIFT descriptor The SIFT interest point descriptor [Low04] is invariant to transformations in image space like changes in illumination, scale and rotation. Illumination changes can be compensated by usage of histograms of gradient orientations. These will be computed relative to the esti- mated interest point scale and orientation. Therefore, the descriptor first divides the interest point s neighborhood at the respective scale in sub-regions (see Figure 11) that are arranged relative to the respective orientation. Also all gradients are rotated relative to that orientation. Gradient magnitudes are now weighted with a two dimensional Gaussian (see Equation 3.2) that is centered over the descriptor. Finally, for each sub-region the weighted gradient magnitudes are accumulated with respect to eight histogram orientation bins. This way, for 4 4 sub-regions a 128-dimensional descriptor vector is created. histograms of gradient orientations Figure 11: On the left the gradient computation within 2 2 sub-regions is illustrated. The Gaussian window (see Equation 3.2) used for gradient magnitude weighting is indicated by the blue circle. On the right the subhistogram creation is shown. These sub-histograms are visualized by length and direction of the arrows within the sub-regions. Note that this example uses a 2 2 eight bin histogram thus forming a 32-dimensional descriptor vector. Image taken from [Low04].

31 3.3 Principal component analysis principal component analysis Principal component analysis (PCA) (cf. [DHS01] Sec. 3.8) can be used for decorrelating data and reducing its dimensionality. In the context of learning bag-of-features representations it will be of importance when decorrelating SIFT descriptors (also see Section and Chapter 5). Dimensionality reduction is essential when using bagof-features statistics as feature representation for a Hidden Markov Model (see Chapter 5). The method is based on analyzing the variance of vector components and correlation between them in a sample set x 1,... x N. The data is transformed to a coordinate system in which all these correlations vanish. Its components are referred to as principal components. The total scatter matrix S T (Equation 3.7) can be used to describe these variations in the sample set. decorrelation and dimensionality reduction principal components scatter matrix S T = 1 N x = 1 N N (x i x) (x i x) T (3.7) i=1 N x i (3.8) i=1 A sample x is a column vector and x T its transpose. x is the sample mean (Equation 3.8). If x 1,... x N is drawn from a single distribution, x is an estimate of its expected value and S T is an estimate of its covariance. Principal components are derived by an eigenvalue analysis of S T such that covariance S T φ i = φ i λ i, (3.9) where φ i is the i-th eigenvector and λ i the corresponding i-th eigenvalue. S T is symmetric thus eigenvectors and eigenvalues, as formalized in Equation 3.9, exist. When transforming the mean free samples with transformation matrix Φ (see Equation 3.10), eigenvalue analysis y = Φ T (x x) (3.10) the data is projected to a new coordinate system that will be centered at the transformed data mean and is oriented such that it reflects the variations of and between vector components in the original dataset. The correlation among the new coordinates vanishes. Φ is constructed from column vectors φ 1,... φ n that correspond to the n principal components and span the new coordinate system. Note that in Equation 3.10 an orthonormal transformation is performed and consequently Φ is an orthonormal basis. For that reason also the Euclidean distances between transformed samples are the same as between their corresponding originals (cf. [Fin08] Chap. 9). In order to perform an additional dimensionality reduction, the transformation matrix Φ (Equation 3.10) has to be modified. It is decorrelation orthonormal transformation dimensionality reduction

32 26 3 bag-of-features image representations (a) 3D point cloud (b) Point cloud after projection Figure 12: Example of decorrelation and dimensionality reduction with principal component analysis. Figure 12a shows a three dimensional point [ October 3, 2011 at 2:12 ] [ October 3, 2011 at 2:12 ] cloud. Colors indicate values of the axis that is oriented vertically. Figure 12b shows the decorrelated two dimensional points after principal component analysis. adapting the transformation matrix to the target dimension exploited that eigenvalues computed from S T reflect the amount of variance within the components described by the eigenvectors computed from S T. Therefore, the column vectors φ 1,... φ n have to be rearranged such that corresponding eigenvalues are ordered by decreasing values. Only eigenvectors of the largest m eigenvalues are used to construct Φ. This way the dimensionality of transformed samples is reduced from n to m < n. Figure 12 illustrates the process of decorrelation and dimensionality reduction. In Figure 12a a three dimensional point cloud is shown. The points are highly correlated with respect to the three dimensions. After applying the principal component analysis (see Figure 12b) the points are decorrelated and their dimension is reduced. The point cloud only varies along the components of its coordinate system. At the same time the relation among points is preserved. 3.4 clustering and quantization vector quantizer codebook size Clustering refers to the process of finding a fixed number of representatives for a set of vectors. Assigning the representatives to these vectors is referred to as quantization. Therefore, the term vector quantizer is commonly used in the literature for clustering algorithms (cf. e.g. [GG92, Fin08]). The set of representatives is referred to as codebook. A practical application can be found in the digitization of analog signals. After sampling, continuous values have to be saved in a digital representation. One possibility would be to quantize them with respect to their closest integral number within a certain range. In general, an important choice is the number of representatives or codebook entries. The bigger the codebook the more accurate the representation at the cost of being more space consuming. On the other

33 3.4 Clustering and quantization 27 hand it might be desirable for some applications to achieve abstraction from the original data. This can be realized with smaller codebooks, although it remains application dependent to find the correct size. This aspect is of importance when quantizing SIFT descriptors for bag-of-features representations. The quantization process can therefore be thought of as an approximation. The resulting inaccurateness can be measured by the average quantization error (Equation 3.11) ([GG92] Sec. 10.3). ɛ(q) = d(x, Q(x))p(x)dx (3.11) R m In Equation 3.11 Q : R m Y is the quantization function mapping from m dimensional vector space to an element from codebook Y. d(, ) denotes a distance measure between two vectors, for example the Euclidean distance. For computing the statistical average, the integral covers weighted distances with respect to the densities p(x) of elements distributed in R m. Vector quantization design is based on two conditions that are closely related: quantization error optimal quantizer 1. Nearest-Neighbor Condition: Specifies the optimal quantization technique of vectors with respect to a given codebook: A vector is represented by the codebook entry with minimal, for example Euclidean, distance. 2. Centroid Condition: Specifies the optimal choice of the codebook for a given partition of the vector space: Codebook entries are given as the centroids of the corresponding partitions. The centroid minimizes the average distance to its elements. The most common vector quantization algorithms have both come to be known as k-means algorithm. The original k-means was developed by MacQueen (1967). A different approach is the so called Lloyd s algorithm that is often incorrectly referred to as k-means (cf. [Fin08] Chap. 4). Lloyd s algorithm directly adapts the two previously discussed conditions for optimal vector quantizers. In an iterative procedure first all vectors are assigned to their closest codebook entry. In the next step, based on the new partition, new centroids for an updated codebook are computed. The procedure is iterated until the quantization error (Equation 3.11) converges. It is important to note that an initial codebook must be given and the algorithm might converge to a local optimum. It is therefore common to make several runs with different random initializations. Figure 13 demonstrates a clustering result that can be obtained with Lloyd s algorithm. Points randomly drawn from three different Gaussian distributions are partitioned into three clusters. Figure 13b shows that the sample scatter generated by the Gaussians cannot be vector quantization algorithms Lloyd s algorithm random initializations

34 28 3 bag-of-features image representations (a) Original data (b) Lloyd s algorithm Figure 13: Example for clustering based on Lloyd s algorithm. Figure 13a shows an artificial point cloud generated from three Gaussian distributions as indicated by the colors. For Lloyd s algorithm the only prior information available is the number of clusters that should be created. Figure 13b Shows the result. Black squares indicate centroids, the colors the quantization of samples. k-means algorithm modeled sufficiently. Due to the usage of Euclidean distances, points get assigned in a circular shape around their closest centroid. In order to quantize large datasets with respect to many codebook entries, the k-means algorithm after MacQueen can be more suitable. For N samples the method is able to compute an optimal codebook with iterating over the data only once. For finite sample sets found in practice, therefore only approximately optimal solutions are possible. Lloyd s algorithm, in contrast, iterates over all vectors after each update of the codebook. The following scheme gives an overview of the algorithm (cf. [Fin08] Chap. 4): 1. Initialization: Given the sample set x 1,... x N and the number of clusters k, let the initial codebook be Y l = x 1,... x k with iteration counter l. For a better distribution of codebook entries over the dataset, an initial codebook entry is only accepted if it is has a sufficient distance from the last accepted entry. 2. Nearest-Neighbor assignment: The next sample x t with t > k is assigned to the nearest-neighbor from the current codebook Y l and becomes part of its associated sample subset. 3. Codebook update: The codebook entry selected in the last step is updated. Therefore, the new centroid from the previously modified sample subset is determined. The centroid minimizes the average distance to all elements from the subset. Steps 2 and 3 are iterated until all samples have been processed. conditions It is important that the algorithm works only with a sufficient amount

35 3.4 Clustering and quantization 29 of samples that are statistically independent of each other. For applications in bag-of-features representations where descriptors from big image databases have to be clustered, especially the first assumption is true. Due to the usage of Euclidean distances in the clustering process, no distortions that may be present in the vector space can be modeled. With a given sample set these distortions reveal themselves in the local scatter of the data. Mixture density models are capable of capturing these characteristics. Here only a method is discussed that can easily be derived from the clustering result. This can then be used for an initial solution to the EM algorithm that is typically used for further refinement (cf. [Fin08] Chap. 4). Different components in the mixture are usually modeled by Gaussian distributions and are therefore defined by an expected value µ and a covariance matrix C. The number of mixture components k and a component weight c are needed for integration into a mixture model (Equation 3.12). mixture models Gaussian mixtures p(x θ) = k c i N (x µ i, C i ) (3.12) i=1 For a definition of the multivariate Gaussian distribution see Equation 2.8. Note that for samples drawn from a single distribution, their mean vector is an estimate for µ and the total scatter matrix (Equation 3.7) is equivalent to an estimate of C. In order to determine the mixture model, all parameters θ can be directly or indirectly derived from clustering results: from clustering to mixture models [ October 12, 2011 at 19:00 ] The number of mixture components is equivalent to the number of clusters. The mixture component weights are given as the priori probabilities P(R i ) = R i /N for the respective cluster R i with N denoting the number of samples. The mean vectors are given as the centroids. The covariance matrices can be estimated from the different clusters (see Equation 3.7). Note that for robustly estimating covariance matrices a large number of samples must be given. This is especially important for high dimensional data. In practice, it is therefore more convenient to estimate diagonal covariance matrices that describe the scatter only along the components of the coordinate system. For quantizing samples based on the mixture model, simply the densities for each mixture component are evaluated with respect to a sample. In case of hard quantization the sample is then assigned robust parameter estimation quantization with mixture models hard quantization

30 3 bag-of-features image representations (a) Hard quantization (b) Soft quantization Figure 14: Example for clustering based on mixture density models.

36 30 3 bag-of-features image representations (a) Hard quantization (b) Soft quantization Figure 14: Example for clustering based on mixture density models. Colors indicate the samples affiliation to mixture components. Black squares show the mean vectors, covariance matrices are indicated by the ellipsis around them. Note that the ellipses only vary horizontally and vertically, due to the usage of diagonal covariances. Figure 14a shows an example of hard quantization, Figure 14b an example of soft quantization. This is indicated by the smooth color transitions between the mixture components. to the mixture component λ with maximum weighted density (Equation 3.13). λ = argmax i {1...k} c i N (x µ i, C i ) (3.13) soft quantization This is equivalent to deciding for the component with maximum a- posteriori probability. Also compare the Bayes classifier discussed in Section In case of soft quantization no definite decision is made but the affiliations are described by all components weighted densities. Figure 14 illustrates the principle of quantization with mixture models. Note the difference to clustering results from Lloyd s algorithm (see Figure 13). These are especially apparent with the horizontally oriented cluster on top. 3.5 applications categorization retrieval After discussing the fundamentals of bag-of-features image representations the two most common applications, image categorization (see Section 3.5.1) and image retrieval (see Section 3.5.2), will be reviewed in this section. In comparison, two different bag-of-features interpretations are presented. For image categorization the feature probabilities with respect to a category are estimated. This way the category can be determined a given image most likely belongs to. In image retrieval, in contrast, entities are represented by their bag-of-features statistic. A distance metric is used to detect similarities. For a given query image the most similar images in a database are retrieved in order of relevance.

3.5 Applications Query image 31 Visual categories (Examples from training dataset) Classiﬁcation Category Faces Buildings Trees Cars Phones Bikes Books Figure 15: Illustration of image categorization.

These can be used to train a statistical classifier whose application is outlined on the right: The query image is classified with respect to one of the known categories. Images taken from [CDF+ 04].

37 3.5 Applications Query image 31 Visual categories (Examples from training dataset) Classiﬁcation Category Faces Buildings Trees Cars Phones Bikes Books Figure 15: Illustration of image categorization. On the left examples for visual categories are shown. These can be used to train a statistical classifier whose application is outlined on the right: The query image is classified with respect to one of the known categories. Images taken from [CDF+ 04] Image categorization Image categorization refers to the supervised classification of images. In a prior training typical representations for images of a joint category are learned. For that reason the images in the training dataset must be labeled. Especially for big datasets that are necessary for good categorization results this aspect is of importance. Figure 15 illustrates the concept. The examples for training images already show that the appearance of objects and scenes within categories is highly diverging. It is therefore important that the sample set is representative. The formerly unknown query images are only assignable to the correct category by taking prior knowledge into account. This knowledge is unavailable to the classifier if images similar to the query are not part of the training database. A bearded alpinist wearing sun glasses and helmet, for instance, has a completely different visual appearance than the faces shown in Figure 15, although the image belongs to the same category. In order to introduce the principle of image classification using bagof-features, the training and application of a Bayes classifier [CDF+ 04] will be explained. In direct comparison the approach is outperformed by the more sophisticated Support Vector Machines (SVMs) [CDF+ 04]. Therefore, a short overview of state-of-the-art methods for image categorization will be given first. An SVM divides feature space by a hyperplane based on training samples belonging to two classes. For the classification of new samples it depends on which side of the hyperplane the samples are found in feature space. Training samples closest to the hyperplane and thus defining it are called support vectors. Multi-class problems can be decided by training multiple SVMs. supervised classification representative sample set Support Vector Machines

38 32 3 bag-of-features image representations spatial pyramid matching probabilistic latent semantic analysis Bayes classifier Bayes theorem maximum a-posteriori classification Other advanced techniques include incorporation of spatial interest point locations in the image [LSP06] and probabilistic latent semantic analysis [BZM06]. The former applies bag-of-features to sub-regions in a spatial pyramid of the image. Loose collections of features are not used to represent whole images anymore but can be spatialized with respect to size and location of the sub-region. The method is built upon an SVM used for classification. In the latter approach a statistical model is learned to discover latent topics that correspond to categories. In particular, relations between visual words can be found that are discriminative for these topics. In [BZM06] a nearest neighbor classifier is used based on vectors of topic probabilities. These are evaluated for training images from different categories. A query image is assigned to the category of the sample whose topic vector is most similar to the query image s topic vector. For being more robust towards outliers also k > 1 nearest neighbors can be evaluated. The category assignment is then done by majority voting. For the Bayesian classifier the visual word probabilities for each category have to be estimated during training. Afterwards, they will be used to compute the maximum a-posteriori probability of the categories with respect to the given bag-of-features statistic. Let {C 1... C K } denote the set of categories and V = {w 1... w ν } the visual vocabulary consisting of V visual words. n t,j is the number of visual word occurrences in the bag-of-features statistic I j of the corresponding image. Equation 3.14 shows the application of Bayes theorem. The a-posteriori probability P(C i I j ) for a category given an image representation can be expressed in terms of the category a-priori probability P(C i ) and the probability of representation I j given the category C i. Because the objective is to determine the category C λ with maximum a-posteriori probability, the term P(I j ) can be neglected. It has no influence in the maximization (Equation 3.15) (cf. eg. [Fin08] Chap. 3). P(C i I j ) = P(C i)p(i j C i ) P(I j ) λ = argmax i {1...K} (3.14) P(C i )P(I j C i ) (3.15) In order to determine the category index λ, P(I j C i ) has to be formulated with respect to visual word counts found in bag-of-features representations (Equation 3.16) [CDF + 04]. P(C i ) can directly be estimated from the training sample set. P(C i )P(I j C i ) = P(C i ) V t=1 P(w t C i ) n t,j (3.16) The product incorporates conditional visual word probabilities P(w t C i ) which are weighted by visual word counts n t,j in the exponent. When evaluating categories for visual word occurrences in a single image,

39 3.5 Applications 33 different a-posteriori probabilities are obtained for different probabilities P(w t C i ). Their influence depends on the frequencies of visual words w t in the bag-of-features representation. In order to finally evaluate Equation 3.16, probabilities P(w t C i ) have to be estimated from the training sample set. This is achieved by simply accumulating the number of occurrences of the respective visual word in all images of the given category. In order to obtain a probability, the value is normalized with the sum of counts of all visual words observed in images of the category (Equation 3.17). training the classifier P(w t C i ) = 1 + I j C i n t,j V + V s=1 I j C i n s,j (3.17) Due to the product in Equation 3.16 zero estimates should be avoided. In Equation 3.17 this is achieved by Laplace smoothing (cf. [CDF + 04]). By adding one in the numerator and V in the denominator, the smallest probability for non-observed visual words will be 1/ V. It is important to note that Equation 3.16 is only defined for statistically independent occurrences of visual words w 1... w ν. In practice, this assumption is highly unrealistic. In these cases the Bayes classifier is therefore referred to as naïve Bayes classifier. The performance of image categorization systems is simply measured by the percentage of correctly classified images. Further analysis is possible in a confusion-matrix where the classification results of samples from one category are analyzed with respect to assignments to all categories. This way it is easily possible to detect confusions between classes. Note that the number of categories considered in the literature is quite low. In [BZM06] the method is evaluated on datasets consisting of four, eight and 13 different categories. In [CDF + 04] seven categories are used. Laplace smoothing naïve Bayes classifier performance evaluation number of categories Image retrieval Image retrieval refers to searching for images in a database where the search query is an image itself. This has to be differentiated from image searching that uses symbolical concepts as query [OD11]. Such a query, for example, could be the concept "sky". In Figure 16 an example for image retrieval is given. The objective is to search for the white graffiti in movie frames. The query is selected in one scene and recognized if it appears again. In comparison to the previously discussed image categorization (see Section 3.5.1) no supervised training is necessary. This also means that no labeling of the data with respect to classes or categories is needed. Instead, bag-of-features retrieval works in an unsupervised manner where images in the database are indexed with respect to their term vector representation (see Section 3.1). For a categorization vs retrieval

34 3 bag-of-features image representations Figure 16: Example for image retrieval in the movie "Run Lola Run". The upper two images illustrate the query selection.

40 34 3 bag-of-features image representations Figure 16: Example for image retrieval in the movie "Run Lola Run". The upper two images illustrate the query selection. The yellow box indicates a region within the scene that should be retrieved. On the right the resulting query image is shown. The lower two images illustrate how the graffiti is found in a different scene. Blue lines show correspondences between the descriptors calculated for the bag-of-features representations. Images taken from [SZ03]. In order to efficiently perform distance calculations over thousands of images in the database, an inverted file structure is used. There- fore, an index is created having one entry per visual word. For each entry a list of database images is saved where each image must contain the respective term. For a query term vector, all images for all term occurrences are evaluated using the index. On these elements the ranking is performed with an according distance measure. The method is fast because it takes advantage of the sparseness of term vector representations. In [SZ03] around 10,000 terms form the visual vocabulary. Given that the variance of scenes and different objects within these is very high, also the visual words in the vocabulary are highly diverging. It is therefore unlikely to find a high percentage of different terms in one image. Additionally, an upper bound for the population of term vectors can be estimated. In [SZ03] images contain 1,600 interest points in average. With their vocabulary this leads to inverted file structure sparseness of term vector representations query image the most similar items from the database are returned and ranked accordingly this way. Similarity is measured by a distance metric computed between term vectors. Figure 17 explains this concept schematically.

41 3.5 Applications 35 Inverted file structure (due to sparse term vector) Bag-of-visual-words (Term vector) Image database Bag-of-visual-words (Term vector) w1 w2 w3 w4 w5 w6 Image Query image / -patch w1 w2 w3 w4 w5 w6 Distance... Bag-of-visual-words (Term vector)... Image w1 w2 w3 w4 w5 w6 Ranked images Image 1... Image n Figure 17: Schematic illustration of image retrieval. Images are represented by bag-of-features representations. In an inverted file structure similarities between these term vectors can be found efficiently. This is outlined by the box framing the distance calculation. The output is then given by a list of images ranked by increasing distances between term vector representations of the query and database entries. an estimated upper bound of 16% non-zero entries in the term vector. It can be concluded that by using the inverted file structure a small subset of images can be selected for further processing in a fast way. As already mentioned, a distance metric is needed in order to measure similarity between term vectors. A popular choice is the Euclidean distance (cf. [OD11]) but other metrics are possible, too. In [SZ03] the cosine of term vector angles is used. Let s and t be two term vectors given in column order and ϕ the angle between them. The respective distance function d cos is given in Equation distance measure d cos (s, t) = 1 cos(ϕ) (3.18) cos(ϕ) = st t s t (3.19) The cosine of an angle enclosed by two vectors is simply the normalized scalar product between them (Equation 3.19), where denotes the Euclidean norm. Because cosine distance cos(ϕ) [ 1, 1] 1 cos(ϕ) [0, 2], (3.20) thus d cos = 0 for vectors pointing in the same direction and d cos = 2 for vectors pointing in opposite directions. This way the distance metric rather takes the distribution of terms into account than the differences between the components. This is suitable because the term bag-of-features interpreted as distribution

42 36 3 bag-of-features image representations visual word frequencies term weighting term-frequency inverse-documentfrequency vector, or bag-of-features representation, can also be interpreted as a probability distribution. Instead of representing bag-of-features by the number of visual word occurrences it is more convenient to use relative measures. In term weighting schemes this is done by rather using visual word frequencies than their counts. This is especially important in order to be robust towards images with a different overall number of interest points. Additionally, terms occurring in many images are not as discriminative as terms occurring only in few images. Consequently, for measuring similarity, different terms are not equally important in the distance calculation. Term weighting is therefore a standard procedure in information retrieval (cf. eg. [BR99, SM86]). The term-frequency inverse-document-frequency (TF-IDF) weighting scheme is widely used. It takes into account how often particular visual words appear in all images from the database and down-weights them accordingly. If (t 1,..., t i,... t ν ) is a term vector, then its components can be defined according to the TF-IDF weighting scheme (Equation 3.21). t i = n i,j n j log N N i (3.21) weighting scheme variants stop lists spatial consistency In the equation, n i,j is the number of occurrences of term i in image j and n j is the overall number of terms in image j. The inverse document frequency is given by the logarithm of the quotient of N, i.e. the overall number of images in the database, and N i, i.e. the number of images in the database the term i occurs in. Other popular term weighting schemes that have been evaluated in the literature are the binary weighting scheme and the term frequency weighting scheme. Experiments in [SZ03] show that they are outperformed by the TF-IDF scheme. Binary weighting only comprises the simple occurrence of visual words and not the number or frequency. Frequency weighting omits the inverse-document-frequency weighting in the previously presented TF-IDF scheme. Additionally, many variants of these methods exist (cf. [BR99]). The concept of stop lists goes beyond TF-IDF weighting. The most and least often occurring terms are eliminated from the visual vocabulary. With respect to inverse file structures this has the effect of term vector representations getting even sparser while only the least relevant information is lost. Very rare visual words are stopped due to their consideration as noise. It is assumed that any relevant scene or object appears often enough that respective visual words have sufficient counts. In [SZ03] an approach is presented that increases the robustness of the retrieval process. This refers in particular to the retrieval of objects. Their spatial interest point arrangement is relatively prominent in contrast to interest point arrangements found in cluttered scenes or homogeneous image regions. In the bag-of-features representation these spatial interest point configurations are not captured at all. Figure 18

3.5 Applications 37 Figure 18: Example for spatial consistency. The upper two images show corresponding interest points within two frames from the movie "Run Lola Run".

43 3.5 Applications 37 Figure 18: Example for spatial consistency. The upper two images show corresponding interest points within two frames from the movie "Run Lola Run". Not all matches are consistent with the spatial arrangement of the graffiti. Those who are, represent the spatial consistency score. Beneath they are indicated in blue. Images taken from [SZ03]. shows an example where the spatial interest point consistency is used to improve the retrieval of the graffiti from Figure 16. Descriptors in the two frames that are represented by the same visual word are considered corresponding. The spatial consistency is then measured by counting how many neighboring interest points in one frame have corresponding neighboring interest points in the second frame. The obtained value denotes the spatial consistency. Based on this measure retrieved images can be re-ranked. In order to measure retrieval performance, the average normalized rank is evaluated with respect to ranks R i of all images relevant to a given query [MMP02] (Equation 3.22). [( ) ] Rank = 1 NR NN R R i N R(N R 1) 2 i=1 (3.22) N denotes the number of images in the database and N R the number of relevant images. Rank = 0 if all relevant images are returned first and Rank = 1 if they are returned last. Values near 0.5 indicate random retrieval. Another common performance measure is the precision to some fixed recalls. Precision is therefore the percentage of retrieved images performance evaluation precision-recall

44 38 3 bag-of-features image representations that are relevant to the query and recall is the percentage of relevant images that were retrieved. 3.6 conclusion Bag-of-features are successfully used in image retrieval and categorization but it is not immediately clear if their application on images of handwritten script will be successful. The SIFT descriptors usually work on richly textured images where images of handwritten words are more or less binary. Nevertheless, the assumption is that by learning representatives for the visual vocabulary, bag-of-features representations will also be discriminative in handwriting recognition. Additionally, the clustering process can be improved by decorrelating descriptors. This fits the usage of diagonal covariances in the mixture model that is estimated from the obtained clusters. It is therefore sufficient to model variations within descriptor vector components but not between them. In order to avoid errors in the descriptor quantization, the bag-of-features statistic is built by probabilistic visual word assignments. This way there are no hard but only soft decisions which describe the uncertainty in the quantization. The application and evaluation of bag-of-features representations in handwriting recognition will be covered in-depth in Chapter 5 and Chapter 6.

45 C H A R A C T E R A N D W O R D R E C O G N I T I O N 4 Towards the bag-of-features integration in Hidden Markov Model based handwriting recognition, this chapter will review related work. The following sections will cover different aspects in this regard. Section 4.1 discusses approaches to part-based character recognition. Partbased refers to representations consisting of local image descriptors that can be used to classify characters into a small number of categories. This strongly relates to bag-of-features applications in image categorization (see Section 3.5.1). In Section 4.2 approaches for word spotting are presented. Word spotting refers to retrieval of word images. A common application is the creation of an index in big historic document collections. For the exploration of bag-of-features in handwriting recognition it is important to find discriminative representations. These can be discovered easier when considering a simpler setting like in holistic word recognition. Methods used in word spotting are therefore related and feature representations are discussed in various directions. Additionally, word spotting is very similar to image retrieval applications presented in Section It can rather be considered a specialization. Common retrieval systems need to handle heterogeneous image collections. A word spotting system s design, in contrast, is focused on word images. part-based character recognition word spotting 4.1 part-based character recognition Part-based character recognition refers to classifying isolated character images with respect to a small number of categories. The methods reviewed in the following Sections and are similar. In both cases local descriptors are calculated and classified individually. A character s category is then determined by a voting scheme over all its associated descriptors. The method presented in Section is different because descriptors are not localized by interest point detectors but the image is dynamically divided into cells where descriptors are computed. character image categorization Digit recognition In [UL10] a method for isolated handwritten digit classification is proposed. Each digit is represented by local image descriptors that are completely unrelated with respect to their spatial configuration. In this regard the method is very similar to bag-of-features approaches presented in Chapter 3. For interest point detection and description, 39

40 4 character and word recognition Figure 19: Example for part-based handwritten digit recognition. Descriptors are indicated by rectangles. Their colors corresponds to the recognized digit class.

46 40 4 character and word recognition Figure 19: Example for part-based handwritten digit recognition. Descriptors are indicated by rectangles. Their colors corresponds to the recognized digit class. Subsequent rows show results for higher numbers of training samples. Image taken from [UL10]. Speed Up Robust Features feature-level recognition character-level recognition rotation invariance and scale invariance the Speed Up Robust Features (SURF) [BETV08] approach is used. As with the SIFT detector (see Section 3.2), blob-like structures are detected. Interest points thus do not lie on edges but in homogeneous image regions. Also the descriptor relies on image edge information. The major difference to SIFT is that SURF focuses more on computational speed. For recognition an initial training is performed. Interest points are detected in images of the training dataset and their local neighborhoods are represented by descriptors. For each image the corresponding digit category in known thus all descriptors can be annotated accordingly. The actual recognition of formerly unknown images is divided into two further steps. After obtaining interest points, their descriptors are first classified by the 1-nearest neighbor rule with respect to the reference descriptors from training. This is referred to as feature-level recognition. In the second step the image is categorized by majority voting over all its feature-level recognition results. This is referred to as character-level recognition. Figure 19 illustrates feature-level recognition. For increasing numbers of training samples descriptor classification results are indicated by differently colored rectangles. Especially parts reoccurring in different digits are likely to be misclassified. This effect is reduced with a bigger training dataset. The method has been evaluated with the MNIST dataset of handwritten digits [LC11]. In general recognition scores on feature-level are quite low (around 50%). Only by using a voting scheme better results on character-level can be achieved, although these are much worse than results published on the MNIST benchmark [LC11]. An important property of local image descriptors is their invariance to rotation and scale. Results presented in [UL10] indicate that these are not suitable for digit recognition. Scale and rotation are therefore fixed as illustrated in Figure 19. Note that images in the

4.1 Part-based character recognition 41 (a) Degraded Glagolitic characters (b) Character recognition Figure 20: Example for local descriptor based character classification.

Areas shaded in gray indicate annotated ground truth. Correctly classified descriptors are shown as green circles, false positives as red rectangles.

47 4.1 Part-based character recognition 41 (a) Degraded Glagolitic characters (b) Character recognition Figure 20: Example for local descriptor based character classification. Figure 20a shows a part of an ancient document. Especially the lower two characters are highly degraded. In Figure 20b the classification of two characters is illustrated. Areas shaded in gray indicate annotated ground truth. Correctly classified descriptors are shown as green circles, false positives as red rectangles. The majority of descriptors agglomerated around each character is classified correctly. White rectangles indicate interest points that have not been considered in the experiment. Images taken from [DS09]. MNIST database are given in a resolution. In order to find a sufficient number of interest points with the SURF detector, images are resampled to pixels Degraded character recognition In [DS09] degraded ancient characters are considered. The usage of local image features is motivated by the bad quality of ancient documents. Typically applied binarization techniques fail if the ink fades. The results tend to get very noisy, on the other hand, if the binarization is sensitive to these structures. Figure 20a shows an example of characters in an ancient document. The lower two characters are of very low visual quality. The approach is closely related to the method presented in [UL10] (see Section 4.1.1). Major difference is the use of Support Vector Machines for classifying descriptors. Additionally, the characters are not given in separate images but in images of whole documents. As already introduced in Section 3.5.1, SVMs solve a two class recognition problem by linearly separating class representatives in feature space. If the two classes are not linearly separable, feature vectors can be non-linearly transformed to a higher dimensional space where the linear separation is possible again. In [DS09] radial basis functions (RBF) are used for this purpose. historic document binarization non-linear Support Vector Machines

48 42 4 character and word recognition Support Vector Machine training descriptor classification character classification The prior training is therefore organized as follows. Character regions in document images are labeled accordingly. Given an image, SIFT descriptors are computed at SIFT interest point locations (see Section 3.2) and annotated with respect to the region labels in the document. Next, for each character class an SVM with an RBF kernel is trained. Descriptors from the respective class are separated from features of all other classes. When processing an unknown image, a class affiliation probability can be estimated for each descriptor by evaluating all SVM classifications. Figure 20 shows an example of ancient characters and classified descriptors. For classifying a character spatially close descriptors vote according to their classification. In [DS09] these spatial descriptor-character correspondences are determined by k-means clustering (see Section 3.4). The number of cluster centers is estimated by the distribution of interest points in scale space. The approach is motivated by the observation that at coarse scale levels each character is roughly represented by one interest point. Therefore, in contrast to [UL10], descriptors are scale and also rotation invariant. The method is evaluated on a corpus of Glagoltic documents. Within the 10 considered character classes good recognition results could be achieved Chinese character recognition elastic mesh Character-SIFT features In [ZJDG09] a feature extraction method for isolated Chinese character recognition is proposed. The approach is based on elastic meshes and gradient features. An elastic mesh divides the image into dynamically sized cells from which gradient features are calculated. This formerly explored approach [JW98] suffers from abrupt changes in feature representations that are already occurring by small changes in the elastic mesh. For reducing these so-called boundary effects, gradients in the cells are weighted with respect to their position and are also employed in nearest adjacent cells. This is similar to weighting schemes used in the SIFT descriptor. Depending on interest point locations these descriptors may also overlap. Due to these similarities the method is referred to as Character-SIFT. Figure 21a shows the process of calculating Character-SIFT features. After smoothing the image for noise reduction and smoothly distributed gradients, the elastic mesh is computed and a gradient image is obtained. The steps are carried out independently. For calculating the elastic mesh, the number of cells must be given. Then the grid is estimated in such a way that cells have an approximately equal character pixel intensity distribution. Typically, white areas in the image are covered by big cells where areas containing many pixels of the pen-stroke are covered by smaller cells. In the elastic mesh illustrated in Figure 21a this can be observed by comparing cells at the top and bottom.

4.1 Part-based character recognition 43 2 1 3 4 0 (a) Character-SIFT pipeline 5 6 7 (b) Gradient quantization Figure 21: Figure 21a shows the Character-SIFT feature extraction process.

49 4.1 Part-based character recognition (a) Character-SIFT pipeline (b) Gradient quantization Figure 21: Figure 21a shows the Character-SIFT feature extraction process. Subsequent steps are connected by arrows. The elastic mesh is indicated by the dynamically spaced red grid. Quantization of image gradients is illustrated by the eight black bins containing directional edge information. In Figure 21b the gradient quantization with respect to these orientations is shown. The red vector denotes the gradient. The magnitudes of its two representatives are determined in parallelogram manner. They are illustrated by blue vectors. Images taken from [ZJDG09]. Directional gradient images I x and I y are obtained by convolution with Sobel mask S x capturing horizontal edges and Sobel mask S y capturing vertical edges. I denotes the input image (Equation 4.1). Sobel image filtering I x = I S x and I y = I S y (4.1) Sobel convolution masks are given in Equation 4.2 (cf. [GW02] Chap. 3) S x = and S y = (4.2) The gradient magnitude g m and gradient direction g d at each point can then be obtained according to Equations 4.3 and 4.4. g m = Ix 2 + Iy 2 (4.3) g d = arctan I y I x (4.4) As indicated in Figure 21a, gradient information will now be quantized to eight orientations. In Figure 21b these are outlined by numbers from 0 to 7. The red vector is defined through direction g d and magnitude g m. It is then mapped to the respective two adjacent orientations in the diagram for quantization. Their magnitudes are determined according to the spanned parallelogram. Each cell in the elastic mesh can be represented by a histogram of eight orientations now. Therefore, quantized gradients are accumulated to their cell and their nearest adjacent cells. They are weighted gradient quantization

50 44 4 character and word recognition orientation histogram depending on their distance to the cell s center. Finally, the histograms of all cells are concatenated to build the image s feature representation. For recognizing characters a nearest-neighbor classifier is used. Within the evaluation on frequently used Chinese characters good recognition results are achieved. 4.2 word spotting image retrieval segmentation-based vs segmentation-free word spotting semi-automatic index generation in historic document databases Word spotting is the task of finding word images in huge document collections. In contrast to general handwriting recognition systems the objective is not to generate a transcription (see Chapter 2) but to find similar instances of a given query word image. Word spotting can therefore be considered as a specialization of image retrieval (see Section 3.5.2). A common application is the creation of an index for searching in historic document databases [RM07]. Performing a complete transcription of document images is not suitable for that purpose. State-of-the-art handwriting recognition systems do not work well enough on historic documents due to their often bad visual quality and great variations in handwriting. Additionally, the recognition domain cannot be constrained. By only retrieving visually similar instances instead of exact matches in the transcription, better results can be achieved. In the following sections two different approaches to word spotting will be presented. In segmentation-based systems (Section 4.2.1), documents are first split into images of different words. Errors in this initial segmentation will result in failure of subsequent steps. The segmentation-free method discussed in Section 4.2.2, in contrast, represents each document as a list of overlapping patches. Similarity is measured between these patches and the query word image. This way the prior segmentation is avoided. An example for the architecture of a segmentation-based word spotting system is given in [RM07]: 1. The images from the document database are segmented into words. 2. By using a clustering technique, visually similar groups of word images are identified. Similarity is measured between feature representations. 3. Clusters representing words that should be part of the index are labeled manually. 4. The labels are then automatically assigned to all words in the respective cluster. 5. The semi-automatically generated transcription of index word images can be used in a full text search.

51 4.2 Word spotting 45 Image database Feature extraction Image Clustering.. Feature representation Pairwise distance Feature representation Image Manual cluster labeling Clusters Labels Semi-automatical annotation Figure 22: Schematic illustration of semi-automatic index generation with word spotting. Word images in the database are shown on the right. Clustering their feature representation identifies similar word images. Clusters are manually labeled afterwards. Finally, all word images showing interesting index words are annotated with their cluster labels. Therefore, the process is referred to as semi-automatic labeling. The feature representation and distance calculation are most important and subject to the further discussion in this section. The complete process is illustrated in Figure 22. The crucial point is the distance calculation between word images in the clustering process (also compare Section 3.4). The following section will thus focus on feature representations of word images and suitable distance measures Segmentation-based word spotting The methods performed to obtain word images from document images are very similar to the segmentation steps in offline handwriting recognition (see Chapter 2). By analysing the document layout, text elements are separated from non-text elements. Horizontal projection profiles are then used to segment lines. Within lines vertical projection profiles can be used to detect words. A different approach for word segmentation used in [RP08] is based on clustering gap distances in lines. Afterwards, standard techniques are applied for normalizing typical variations in handwriting like slant, size or baseline variations. For a discussion of preprocessing and normalization methods refer to Section 2.1. For word spotting methods discussed in this section, the basic idea of feature extraction is to interpret the word image as time series. segmentation normalization feature extraction

52 46 4 character and word recognition distance measures Elements of this series are obtained from small portions of the image. The method is referred to as sliding window approach (see Section 2.3). After obtaining the feature vector sequence from word images, a distance measure must be defined. In [RM03a, RM07, RP08] the difference between two time series is determined by dynamic time warping (DTW). In the remainder of this section two methods for feature extraction will be presented. This will be followed by a brief discussion of DTW and HMMs for word recognition. Feature extraction methods used in [RM03a, RM07] are very similar to the features introduced by [MB00] (also compare the geometrical features introduced in Section 2.3). All features are computed column- wise and are normalized with respect to the word image height. This can be considered a sliding window of width one without overlap. Geometrical features include: geometrical features Number of black pixels (i.e. the vertical projection profile). Position of the topmost black pixel (i.e. upper word profile). Position of the lowermost black pixel (i.e. lower word profile). Black-white transitions. intensity-based features Additionally, the use of some intensity-based feature sets was evaluated in [RM03a]. All images are normalized to the same height and an image transformation is performed. Each image column can be interpreted as feature vector, now. The following convolution kernels were evaluated: Gaussian kernel. Horizontal partial Gaussian derivative kernel. Vertical partial Gaussian derivative kernel. local gradient histogram features In their experiments the geometrical features worked best. The last feature extraction method for word spotting that will be discussed was presented in [RP08]. It is especially interesting because it is inspired by the SIFT descriptor. Therefore, it is related to the Character-SIFT approach (see Section 4.1.3) and is another example for SIFT-based feature extraction in handwriting. Figure 23 shows an overview of the process. After smoothing the image with a Gaussian convolution kernel, the symmetric gradient is computed. In Equations 4.5 and 4.6 L denotes the Gaussian smoothed image (compare Equation 3.3), G x and G y are the horizontal and vertical symmetric gradient images, respectively. G x = L(x + 1, y) L(x 1, y) (4.5) G y = L(x, y + 1) L(x, y 1) (4.6)

53 4.2 Word spotting 47 Figure 23: SIFT-based sliding window application. Input to the method are segmented word images. After computing gradients on the Gaussian smoothed images, a window is slid in writing direction. Subsequent windows overlap. The single windows contain gradient vector fields that will be accumulated to orientation histograms similar to the SIFT descriptor. The feature vector is obtained by concatenating the histograms. Image taken from [RP08]. The method is very similar to the gradient computation with Sobel convolution masks (compare Equations 4.1 and 4.2). In contrast, the Sobel kernel directly integrates the smoothing operator. Gradient magnitudes and directions can then be obtained analogously to Equations 4.4 and 4.3. The next step is to slide a window over the gradient vector sliding window and field. Subsequent windows overlap. For feature extraction the window SIFT integration is dived in M N cells. For each cell the gradients are accumulated with respect to their direction and magnitude. Similar to the Character- SIFT approach (see Section 4.1.3), gradients are quantized to eight orientations (also compare Figure 21b). Experiments in [RP08] show that the SIFT-like sliding window features outperform geometrical features on the considered dataset. In order to match words against each other, a distance measure distance measures is needed. Here, the dynamic time warping approach will briefly be presented. For recognizing complete words in a statistical model also HMMs are suitable. The idea of their application in whole word recognition will be outlined. The similarity of two time series of same length can be measured by accumulating the component-wise difference. For time series of

54 48 4 character and word recognition dynamic time warping sequence alignment modeling efficient evaluation HMM-based word recognition different length the same method can be applied by simply scaling or padding the sequences. Similarities in sequences of different length will not be captured adequately. This is not suitable for variabilities found in handwriting. With DTW these variabilities can be captured non-linearly. Let X = (x 1,... x N ) and Y = (y 1,... y M ) denote two time series of different length N and M. D(i, j) is then the distance between subsequences X 1:i and Y 1:j with 1 i N and 1 j M respectively. The objective is to find a path through the N M matrix induced by D with minimal distance or costs. This way, the non-linear stretching and shrinking of the compared sequences can be modeled. For efficient evaluation the problem is constrained. This regards, for example, matching start and end points and possible movements on the path. These are consistent with the variations found in word images. Assuming no errors in writing, two different handwritten images of the same word will both start and end with the same letter, but especially the character width may vary non-linearly. Following these constraints and using dynamic programming, the path with minimal costs can be found in a manner similar to the computation of the state sequence with optimal production probability in the Viterbi algorithm (see Section 2.4). In Equation 4.7, D(i, j) is thus recursively defined with initial path value D(1, 1) = d(x 1, y 1 ). D(i, j 1) D(i, j) = min D(i 1, j) + d(x i, y j ) (4.7) D(i 1, j 1) The algorithm makes the locally optimal decision and adds the costs d(x i, y j ) for the current elements of the sequences. The possible movements on the path are defined through the options in the minimization. Note that the path following the diagonal between D(1, 1) and D(N, M) would correspond to a linear matching of the sequences. Further information on dynamic time warping in word spotting can be found in [RM03b]. For an in-depth treatment refer to [HAJ90]. In [RP08] additionally a Hidden Markov Model is used for evaluating the performance of feature representations. In this context an HMM model is estimated for each word using 10 states per character. For recognizing a particular sequence of feature vectors, the production probability for each HMM is evaluated. The query image is thus most similar to the word whose HMM has the highest score. For an overview of HMM based handwriting recognition refer to Section Segmentation-free word spotting initial segmentation drawbacks Methods presented in Section all rely on an initial segmentation of document images to obtain word images. This might fail due to small gaps between words or large gaps within words. In both cases the subsequent steps are likely to fail, too. The problem mainly results

55 4.2 Word spotting 49 Spatial pyramid Bag-of-visual-words Query image / -patch Topic vector Matching w1 w2 w3 w4 w5 w6 Document database (for each document image) Spatial pyramid Bag-of-visual-words Result (Database images) Image Overlapping patches Topic vector Image 1 Top patches... Image n Top patches w1 w2 w3 w4 w5 w6 Figure 24: Schematic illustration of bag-of-features word spotting. On the left side the bag-of-features creation for image patches in the document database and for the query is shown. A topic space representation is used for spotting similarities. On the right the final result is depicted: For each image of the database the most similar patches are retrieved. from the heuristic nature of segmentation methods. In most cases they will work but extreme cases cannot be handled sufficiently. The word spotting method described in [RATL11] is based on bag-offeatures. For that reason the methods and their results will be also of interest for bag-of-features in handwriting recognition. The following discussion will focus on the details of the approach. For an introduction to bag-of-features image representations refer to Chapter 3. Also compare Section due to analogies to image retrieval. Bag-of-features word spotting works on different datasets like historic handwritten documents or English and Persian typewritten documents. Due to the use of local image descriptors, no binarization is necessary and other normalizations like slant or baseline corrections can be omitted as well. The process is divided into two steps. First a patch-based bag-offeatures representation must be created for all images of the document database. Representations are patch-based to avoid an initial segmentation of document images. Their size has to approximately match the word size. This way it is likely that each word will be entirely part of at least one patch. Afterwards, similarities to a query image patch can be computed in the retrieval stage. The whole process is depicted in Figure 24. The creation of bag-of-features representations for image patches consists of the following steps: bag-of-features word spotting patch-based representation 1. SIFT descriptors are computed in a dense regular grid. At each grid point three differently scaled descriptors are calculated. Descriptors with a low gradient magnitude are directly discarded in order to concentrate on the pen-stroke. dense interest point sampling

56 50 4 character and word recognition spatial pyramid representation latent semantic indexing word image retrieval 2. Each descriptor is assigned to a visual word from the visual vocabulary. This has to be created beforehand. The bag-of-features statistic is then obtained according to a spatial pyramid. In this pyramid the image is divided into differently scaled regions (also compare Section 3.5.2). Bag-of-features representations from each region are concatenated to a single high dimensional vector. 3. For each document in the database a transformation is estimated in oder to obtain a topic space representation. This achieves the same as probabilistic latent semantic analysis that was introduced in Section Here a different method is used that relies on singular value decomposition instead of a stochastic model. For further details refer to [RATL11]. For spotting words in the database, the topic vector from the query patch is matched against all topic vectors from document image patches. Over all documents the most similar patches are retrieved. This way not only a particular document, but also the word locations in the document are made available to the user. 4.3 conclusion Applications in character recognition and word spotting show that local image features have successfully been applied on images of handwritten script. Furthermore, also a bag-of-features application exits. Nevertheless, experiments in part-based character recognition only work on a small number of categories. Thus, for an application in offline handwriting recognition these methods alone are not sufficient. Word spotting methods are able to handle greater variability but no exact transcription is performed. Experiments with learned bag-offeatures representations will show if the script independence of the presented bag-of-features word spotting approach is applicable to handwriting recognition.

57 B A G - O F - F E AT U R E S H A N D W R I T I N G R E C O G N I T I O N 5 Bag-of-features handwriting recognition integrates Hidden Markov Models (Section 2.3) with bag-of-features representations (Section 3.1). These replace the geometrical features presented in Section 2.3. Due to the estimation of the visual vocabulary, bag-of-features representations can be considered learned. This way they are able to automatically adapt to different scripts and other characteristics in handwriting. In [RATL11] different vocabularies were created for three different scripts including handwritten and typewritten documents. Geometrical features, in contrast, are hand-crafted and designed with expert knowledge. This means they cannot be automatically adapted to different scripts but have to be re-designed manually. Typewritten and handwritten scripts are a good examples in this regard because features in typewritten script are based on connected pixel-components (cf. [RATL11]) where features in handwritten script are based on column-wise pixel-representations (see Section 2.3). A handwriting recognition system consists of many different steps (see Section 2.1). Each of them is important but it is essential to use discriminative and characteristic feature representations. Applications of bag-of-features to images of handwriting script were therefore investigated first. For this purpose the recognition of complete isolated word images was considered. Based on results from these preliminary experiments, bag-of-features representations were integrated in a segmentation-free HMM-based handwriting recognition system. In Section 5.1 different methods used in this system are described in detail. Please note that many decisions were motivated by the previously preformed holistic experiments. For that reason Section 5.2 will briefly review the method used for holistic word recognition. learned features vs hand-crafted features bag-of-features investigations 5.1 context-based method The context-based method presented in this section uses a Hidden Markov Model for handwriting recognition. For learning feature representations a bag-of-features approach will be integrated. Novelties are the serialization of holistic bag-of-features representations and the direct usage of bag-of-features probabilities in the HMM. Bag-offeatures represent complete images thus a more local description is necessary. With respect to the HMM integration, bags-of-features can either be interpreted as feature vectors to train a Gaussian mixture model or the Gaussian mixtures can be omitted to directly learn feature probabilities with respect to the HMM states. In Section an bag-of-features HMM integration 51

58 52 5 bag-of-features handwriting recognition overview will be given. In the following sections the different steps will be discussed in detail Overview architecture of the bag-of-features handwriting recognition system Figure 25 shows an overview of the bag-of-features integration in an HMM-based handwriting recognition system. The process is divided into the six depicted steps: 1. Feature detection, 2. feature description, 3. creation of the visual vocabulary, 4. descriptor quantization with respect to the visual vocabulary, 5. sequential creation of bag-of-features representations and The process starts with binary text-line images that have been normalized according to variations found in the script considered. In the feature detection, Harris Corner points are calculated for a dense in- terest point representation of the pen-stroke. Descriptors are extracted at all detected interest points in the feature description step. Both, feature points and descriptors are visualized in the example text-lines. The visual vocabulary is only created in the learning stage. Descriptors extracted from a training dataset are clustered and the vocabulary is given by the cluster representatives. When clustering SIFT descrip- tors, visual words consist of 16 histograms of orientations. In Figure 25 a few visual words indicate the vocabulary. In the descriptor quantization each descriptor is represented by the most similar instance from the visual vocabulary. In Figure 25 differently colored points show the associations to visual words. Note the similar color schemes for visually similar characters. For example, descenders of g and y characters are predominantly colored in green and red. Also circular character structures have similar color patterns. In order to create an observation sequence for the HMM, a sequence of bag-of-features representations is extracted. Figure 25 shows the sliding window that is used for obtaining a weighted histogram of visual word occurrences. Note that the depicted window is wider than in practical applications for a better visualization. Finally the HMM can be used for handwriting recognition. Figure 25 shows states, transitions and indicates the generation of observations. feature detection feature description visual vocabulary descriptor quantization sliding bag-of-features train or decode an HMM 6. the usage of an HMM for recognition Features As discussed in Chapter 3, for bag-of-features image representations local interest point descriptors are most commonly used. Depending

5.1 Context-based method 53 Feature detection Feature description Learning Visual vocabulary Descriptor quantization Sliding bag-of-features Bag-of-visual-words Bag-of-visual-words

59 5.1 Context-based method 53 Feature detection Feature description Learning Visual vocabulary Descriptor quantization Sliding bag-of-features Bag-of-visual-words Bag-of-visual-words Bag-of-visual-words w1w2w3w4w5w6 w1w2w3w4w5w6 w1w2w3w4w5w6 Train or decode HMM Figure 25: Bag-of-features integration in an HMM-based handwriting recognition system. Text-line image taken from [MB02].

60 54 5 bag-of-features handwriting recognition Figure 26: Example for feature detection in word images. The image on top shows Harris Corner points. Beneath the application of the SIFT detector is demonstrated. Text-line image taken from [MB02]. feature detection for images of handwritten script SIFT interest point detection for images of handwritten script on the task, interest point detection can be omitted and replaced by a dense interest point grid (cf. [OD11]). This also seems suitable for images of handwritten script because all areas containing pen-strokes can be considered equally important. Completely white areas on the other hand are not interesting at all. In [RATL11] a filter is applied on the dense interest point grid for that reason. If a descriptor has a low gradient magnitude, it is directly discarded. Here, the Harris Corners feature detector (Section 3.2.1) is used. It achieves a similar objective. Due to the curvatures found in cursive script a dense interest point coverage of the pen-stroke is obtained. Furthermore, it is very important to note that for bag-of-features representations a sufficient number of features is necessary. Otherwise no meaningful statistic results. Apart from dense interest point grids also the SIFT detector is used in bag-of-features related applications (cf. [OD11]). It will not be considered here for mainly two reasons: 1. In binary images of handwritten script usually only an insufficient number of interest points is detected (compare Figure 26). 2. Interest points are detected in a difference-of-gaussian scale space. The different scales represented this way will hardly be found in different word images. Thus, the scale invariance does not normalize size variations found in handwritten script. It is therefore also not suitable to base descriptor extraction on images from the Gaussian scale space. SIFT descriptors at Harris interest point locations These arguments will in general not apply for richly textured images found, for example, in scene categorization or image retrieval. Also scale invariance is a desirable property in these cases. Nevertheless, in [SZ03] some bad retrieval results originated from underrepresented low-contrast image regions. Figure 26 shows Harris Corners (top) and SIFT (bottom) feature points in comparison. SIFT interest points are sparsely distributed and also lie off-script. Descriptors are computed from the local interest point s neighborhoods. As in other applications (compare Section 3.5 and Chapter 4) the 128-dimensional SIFT descriptor (Section 3.2.2) is used. Because Harris Corner points are only defined by their spatial image location,

61 5.1 Context-based method 55 Figure 27: Examples for different feature description techniques. In the lower two images SIFT descriptors are visualized. The two examples depict different approaches for setting descriptor orientations. The topmost image shows SIFT interest points in terms of orientation and scale. Their mean scale is used to estimate the descriptor size. Note that only a few interest points and descriptors are shown for a better visualization. Text-line image taken from [MB02]. no scale, scale space representation and principle interest point orientation is given for the SIFT descriptor computation. Instead, descriptors are directly obtained from the image. Scale and orientation parameters have to be determined beforehand. Note that the scale directly corresponds to the spatial size of the descriptor in the image. Therefore, a single scale is calculated for all feature points. It is given by an estimate of the mean SIFT interest point scale. For that reason SIFT interest points are computed with the SIFT detector first. Figure 27 (top) shows a few SIFT feature points and their orientation and scale. The size used for descriptors at Harris Corners points (Figure 27 middle and bottom) corresponds to the mean of all SIFT feature point scales. The estimated size corresponds to the image coverage of a scale invariant SIFT descriptor computed at that scale level. For example, if the majority of SIFT interest points is detected at coarse scales, the descriptor will cover a bigger area. This way the descriptor size adapts to different pen-stroke widths in images of handwritten script (also compare Figure 29). In order to control the size with respect to different datasets, the estimated scale is multiplied with a descriptor magnification parameter. Additionally, it is possible to extract more than one descriptor at a feature point. In [RATL11] at all feature locations three descriptors with pixel-sizes 5 5, and are computed. For the approach discussed in this section, multi-scale descriptors are not of major importance. Further details can be found in Section Finally, the descriptor orientation must be determined. Four different methods will be discussed: 1. Fixed orientation, scale estimation for all Harris Corners points multiple descriptors at a feature point orientation estimation for each Harris Corners point 2. orientation invariance,

62 56 5 bag-of-features handwriting recognition 3. restricted orientation invariance and 4. orientation according to a local slant estimation. An example for fixed orientations can be found in the feature descrip- tion step in Figure 25. This configuration has already been found effective in [UL10]. For orientation invariance, a principle orientation must be determined for the Harris Corner points first. This is achieved analogously to the SIFT principle-orientation estimation. An example can be found in Figure 27 (middle). All descriptors have individual orientations. Restricting orientation invariance is achieved by a slight modification. Instead of using 360 degrees, all orientations in [180, 360] are mapped to the interval [0, 180]. This way the upright orientation of script is taken into account. Finally, the local slant estimation is directly based on these 181 degree orientations. The orientations of interest points in an image column are estimated based on all interest point orientations in the surrounding neighborhood. In order to select these, a Hamming window is slid over the image. This way a weighted histogram of orientations is obtained. The Hamming window is frequently used in signal processing e.g. in the spectral analysis of audio signals. Here, it only serves the purpose to weigh interest point orientations with respect to the distance to the considered image column. The Hamming window is defined in Equation 5.1 (cf. [Har78]). N denotes the window s length and 0 n N 1. fixed orientation orientation invariance restricted orientation invariance orientation according to a local slant estimation [ November 6, 2011 at 18:29 ] [ November 6, 2011 at 11:05 ] descriptor subspace representation ( ) 2πn w(n) = cos N 1 (5.1) As mentioned before, the orientation of script in general is upright. Therefore, a further weighting scheme is applied to the orientation histograms. The local slant is mainly defined by vertical and only insignificantly by horizontal structures. Let 0 and 180 degree correspond to orientations of horizontal structures. Then the histogram can be weighted with a Gaussian window of length N = 181 according to Equation 5.2 (cf. [Har78]). w(n) = e 1 2 ( ) n (N 1)/2 2 σ(n 1)/2 (5.2) The function is defined for 0 n N 1 and σ 0.5. The Gaussian window is used because of the standard deviation parameter σ. By modifying its value the weight of horizontal orientations can be changed. A Hamming window, in contrast, has a fixed shape. The final result of the local slant estimation is depicted in Figure 27 (bottom). The descriptor orientations follow the local slant of the script. After descriptor extraction, the feature calculation is concluded with decorrelation and optional dimensionality reduction. The respective PCA transformation (see Section 3.3) is estimated on descriptors from the training dataset.

63 5.1 Context-based method Visual vocabulary and descriptor quantization The visual vocabulary is estimated in the learning stage. Therefore, descriptors must be extracted from a training dataset. For the considered datasets over five million 128-dimensional descriptors must be clustered with respect to between 1500 and 2500 visual words. For that reason only McQueen s k-means algorithm is used. In contrast to Lloyd s clustering algorithm only one iteration over the dataset is necessary (see Section 3.4). The size of the vocabulary is chosen empirically and will be further investigated in the evaluation (Section 6.3). During the clustering process the Euclidean distance is used. For the subsequent quantization, a Gaussian mixture model is estimated based on clustering results (see Section 3.4). Sample-cluster affiliations can be phrased in terms of probabilities now. Two different quantization variants can be considered: 1. Hard quantization: A sample is associated with the cluster to which it has maximum affiliation probability. McQueen s k-means vs Lloyd s algorithm descriptor quantization techniques 2. Soft quantization: A sample is associated with multiple clusters. For each sample all cluster probabilities are saved and will be used in the bag-of-features statistic (see Section 5.1.4). Covariances estimated in the Gaussian mixture model are diagonal. The decorrelation of SIFT descriptors should improve the statistical properties of the data in order to better fit with the statistical model. Note that the example for a visual vocabulary in Figure 25 is directly based on SIFT descriptors. Visual words derived from clustering decorrelated descriptors cannot be interpreted in terms of orientation histograms anymore. decorrelation for better mixture model integration Sliding bag-of-features In previously discussed bag-of-features applications the bag-of-features statistic was always obtained from all visual word occurrences within a given image. An HMM, in contrast, needs a sequence of feature representations. As usual in HMM-based handwriting recognition, a sliding window will be applied for that purpose. The major difference is that the window will be used to create a weighted histogram of visual word occurrences. The sliding window thus accomplishes a localization of the bag-of-features representation. For obtaining the weighted histogram, two aspects must be taken into account: The weight of a particular quantized descriptor with respect to localizing bag-of-features representations 1. its visual word and 2. its distance to the window center.

64 58 5 bag-of-features handwriting recognition weighting visual word occurrences term weighting whitespace handling The weight with respect to the visual word is one in case of hard quantization and the respective probability in case of soft quantization (see Section 5.1.3). For the weight regarding the distance to the window center, the window type and size are essential. In case of a rectangular box window that is also used in the geometric feature extraction (see Section 2.3) the weight is either one or zero depending whether a visual word occurs within the window boundaries or not. With the Hamming window, in contrast, a smooth weighting is applied. It has already been considered for the creation of smooth orientation histograms in Section (see Equation 5.1). After accumulating the weighted visual word occurrences, the histogram is normalized to obtain frequencies. Optionally, the inversedocument-frequency weighting scheme can be applied to reduce the influence of very frequent visual words. The IDF scheme is estimated based on training data. Further details on bag-of-features weighting schemes can be found in Section Finally, a special case must be considered. If no visual words occur under the window, an empty histogram is obtained. In order to model these whitespaces explicitly, a pseudo visual word is introduced. Its frequency is one if no other visual words occur and zero otherwise. This means the number of visual words is implicitly increased by one Hidden Markov Models For using Hidden Markov Models with bag-of-features representations two possibilities will be discussed: 1. Semi-continuous HMM. Model parameters are estimated by interpreting bag-of-features representations as feature vectors. 2. Bag-of-features HMM. For each state, probabilities of the respective visual words are estimated. The following sections will address these variants. Semi-continuous Hidden Markov Model estimating a mixture model from bag-of-features representations The first approach is very similar to the semi-continuous HMM presented in Section 2.4. The major difference is the feature representation. Geometrical features and their derivatives (see Section 2.3) form an 18-dimensional feature vector. When estimating the Gaussian mixture model, each mean vector is 18-dimensional and the diagonal covariance matrices will be dimensional. In the context-based method the visual vocabulary consists of more than 1500 visual words. If bag-of-features representations would directly be used for estimating the statistical model, the dimension of mean vectors and covariance matrices would increase accordingly. As already mentioned in Sections 2.4 and 3.4, the number of training samples necessary to robustly

65 5.1 Context-based method 59 Visual-word probabilities Visual-word probabilities... w1w2w3w4w5w6 w1w2w3w4w5w6 Figure 28: Schematic illustration of an HMM directly observing visual word distributions. estimate the statistical model is directly correlated with the number of its free parameters. For the usage of bag-of-features representations in semi-continuous mixture models this means that the feature vector dimensionality must be reduced to the same order as the dimensionality of geometrical features. This is accomplished by PCA (see Section 3.3). A major problem in the estimation of the transformation is the number of samples. From each sliding window position in all images of the training dataset, a bag-of-features representation is created. For the considered training datasets more than four million representations are obtained. Each has at least 1500 dimensions. In order to cope with the amount of data, only every second sample is used in the estimation. In addition to these dimensionality reduced feature vectors their first derivatives can be considered. In the first of two variants the derivatives are concatenated with the original vector thus doubling its dimension. The second variant builds upon the first: Another PCA is estimated to reduce the previously doubled dimensionality to its half. reducing the bag-of-features dimensionality using first derivatives Bag-of-features Hidden Markov Model Instead of estimating a semi-continuous Gaussian mixture model, visual word probabilities can directly be estimated for each state. Consequently, the sequence of observations consists of visual word probabilities. Figure 28 illustrates this concept schematically. This is different to the former visual word frequencies because probabilities must accumulate to one. Probabilities can be obtained from frequencies by simple normalization. For that reason the whitespace visual word introduced in Section is important. In Equation 2.10 the observation densities in a semi-continuous HMM are defined according to a Gaussian mixture. Here, the Gaussian components can be replaced by the bag-of-features model. Let f denote visual word probabilities in the observation sequence and V the number of items in interpreting bag-of-features representations as probability distributions

66 60 5 bag-of-features handwriting recognition the visual vocabulary thus the dimension of f. Then the probability of observing a distribution of visual words in a particular state is given in Equation 5.3. pseudo discrete HMM b j ( f ) = V c jk f k (5.3) k=1 The only parameters to be estimated in the HMM training are the coefficients c jk that describe a mixture of visual words per state. In a discrete HMM the probabilities for observing a symbol in the active state are estimated (cf. [Fin08]). Here, the symbols correspond to the visual words. The difference to a discrete HMM is that not a specific visual word is observable at a point in time but a distribution of visual words. The HMM can therefore be considered pseudo discrete. 5.2 holistic method word image categorization The holistic recognition of word images using bag-of-features representations will be considered in this section. The word image categorization will be based on a k-nearest-neighbor (k-nn) or alternatively a Bayes classifier. The k-nn classifier uses a distance measure for bag-of-features representations. The categorization will be based on most similar instances from the annotated training dataset. Classification techniques are related to character recognition (Section 4.1) and the bag-of-features distance measure is related to approaches in word spotting (Section 4.2). In general, the method is not intended to produce state-of-theart recognition results. It can rather be considered an experimental environment to investigate the characteristics of bag-of-features representations of word images. In the remainder of this section first bag-of-features representations for images of handwritten words will be discussed (Section 5.2.1). Afterwards, two classification approaches are presented. Many aspects are similar to the methods used in the HMM-based approach. In the following sections mainly the differences will be outlined Features feature detection Input to the method are segmented word images that must be represented in terms of bag-of-features. The Harris Corner detector is used to obtain a dense interest point representation of the pen-stroke (also compare Figure 8). SIFT descriptors are extracted at all Harris Corner points. For the representation of complete images, experiments have shown that extracting multiple descriptors of increasing scales at an interest point location produces best results. This is also consistent with the word spotting approach in [RATL11]. As in the context-based method (Section 5.1.2), descriptors are not computed in

5.2 Holistic method 61 Figure 29: Examples for feature description in holistic word recognition. At each interest point location three differently sized descriptors are extracted.

a Gaussian-scale-space representation but directly in the image. Word images scale space representations do not represent scaled instances of a word.

67 5.2 Holistic method 61 Figure 29: Examples for feature description in holistic word recognition. At each interest point location three differently sized descriptors are extracted. In comparison, the two words differ largely in the width of their pen-stroke. The size estimation is able to adapt descriptors accordingly. Word images taken from [PMM + 02]. a Gaussian-scale-space representation but directly in the image. Word images scale space representations do not represent scaled instances of a word. Here, different scales therefore refer to descriptors covering different spatial resolutions. For obtaining the different scales, an initial scale S 0 for the smallest descriptor must be determined. The estimation is performed analogously to the method described in Section Following scales are recursively defined with i 1 according to Equation 5.4. multiple descriptor scales S i = S 1.5 i 1 (5.4) The exponent 1.5 has been empirically evaluated in preliminary experiments. Figure 29 shows a few descriptors extracted in two different word images. Three different scales are used at each interest point. Also note that the pen-strokes differ largely and the initial descriptor sizes are adapted accordingly Categorization For categorization, images of a training and test dataset must be given in terms of their bag-of-features representations. The visual vocabulary is estimated from the training dataset. For further details refer to Chapter 3 and Section 5.1. In the remainder of this section the naïve Bayes categorization and the k-nearest-neighbor categorization will be described briefly. The idea of a naïve Bayes classifier was introduced in Section For a given set of categories, the visual word probabilities with respect to each category are estimated in a prior training. A formerly unknown image is then classified into the category with maximum a-posteriori probability with respect to the bag-of-features representation of the given image. For the holistic recognition of word images, the set of categories is defined by the lexicon of words that will be recognized. naïve Bayes categorization

Bag-of-Features Representations for Offline Handwriting Recognition Applied to Arabic Script

2012 International Conference on Frontiers in Handwriting Recognition Bag-of-Features Representations for Offline Handwriting Recognition Applied to Arabic Script Leonard Rothacker, Szilárd Vajda, Gernot