Syntax-directed content analysis of videotext application to a map detection and recognition system

Size: px

Start display at page:

Download "Syntax-directed content analysis of videotext application to a map detection and recognition system"

Isabella Flynn
5 years ago
Views:

1 Syntax-directed content analysis of videotext application to a map detection and recognition system Hrishikesh Aradhye *, James Herson, and Gregory Myers SRI International, 333 Ravenswood Avenue, Menlo Park, CA ABSTRACT Video is an increasingly important and ever-growing source of information to the intelligence and homeland defense analyst. A capability to automatically identify the contents of video imagery would enable the analyst to index relevant foreign and domestic news videos in a convenient and meaningful way. To this end, the proposed system aims to help determine the geographic focus of a news story directly from video imagery by detecting and geographically localizing political maps from news broadcasts, using the results of videotext recognition in lieu of a computationally expensive, scale-independent shape recognizer. Our novel method for the geographic localization of a map is based on the premise that the relative placement of text superimposed on a map roughly corresponds to the geographic coordinates of the locations the text represents. Our scheme extracts and recognizes videotext, and iteratively identifies the geographic area, while allowing for OCR errors and artistic freedom. The fast and reliable recognition of such maps by our system may provide valuable context and supporting evidence for other sources, such as speech recognition transcripts. The concepts of syntax-directed content analysis of videotext presented here can be extended to other content analysis systems. Keywords: Video OCR, video content analysis, map detection and recognition, syntax-directed recognition and retrieval 1. INTRODUCTION The volume of collected multimedia data of potential interest to the intelligence and homeland defense analyst is expanding at a tremendous rate. A capability to automatically identify the contents of video imagery would enable videos to be indexed in a convenient and meaningful way for later reference, and would enable actions such as automatic notification and dissemination to be triggered in real time by the contents of streaming video. Besides speech, closed captioning, and visual content, videotext (text superimposed on images and video frames) is an important source of semantic information in video streams of news broadcasts. The recognition of text superimposed on video frames yields useful information such as the identity of a speaker, his or her location, the topic under discussion, sports scores, product names, and associated shopping data, allowing for automated content description, search, event monitoring, and video program categorization. For instance, the proposed system can detect and geographically localize near-full-screen political maps from news broadcasts, such as the maps shown in Fig. 1, using image analyses based on videotext recognition in lieu of a rigorous, scale-independent generic shape recognizer. The recognition of such maps can help determine the geographic focus of a news story directly from video imagery; and may provide valuable context and supporting evidence for the speech recognition transcripts generated from the audio track of the story. The recognition of text is easier and faster than the recognition of objects in an arbitrarily complex scene, because text has been designed to be readable and has regular forms that humans can easily interpret. For these reasons, recent work has focused on the use of videotext extraction and recognition, instead of rigorous object recognition, for content analysis. For instance, a recent video content retrieval system 1 learns to associate extracted faces from video frames with recognized textual content of the superimposed caption, which presumably includes the names of the persons shown in the video. The system then uses videotext recognition results, along with a lexicon of names, to recognize the occurrences of the persons faces in news broadcasts. Analogously, most published videotext recognition work focuses on the textual content of videotext objects. Our work, however, is instead based on our contention that the syntactic aspects of videotext objects, such as their placement relative to the frame as well as to each other, in addition to their size, typeface, and color, may present the intelligence analyst with additional, yet-unexplored information. The current application is focused on the detection and recognition of political maps from news broadcasts. However, similar * Hrishikesh.aradhye@sri.com; phone ; fax Document Recognition and Retrieval X, Tapas Kanungo, Elisa H. Barney Smith, Jianying Hu, Paul B. Kantor, Editors, Proceedings of SPIE-IS&T Electronic Imaging, SPIE Vol (2003) 2003 SPIE-IS&T X/03/$

2 principles can be extended to other video-based detection and recognition systems, such as those that recognize scores and statistics in sports broadcasts. As in other work in video content analysis, such as the detection and recognition of human faces in video, we first use the unique image-domain features of political maps to detect the presence of a map on a given frame. Next, we attempt to pinpoint the geographical area covered in the map, such as Eastern Europe, South-East Asia, or the Middle East. To this end, it would suffice to estimate the geographical coordinates of the center of the map and its magnification. Explicit shape recognition of the territorial lines on the map would be a difficult and computationally expensive task. Our method is instead based on the premise that the relative placement of text superimposed on a map roughly corresponds to the geographic locations the text represents. For instance, a map of North and South Americas from Fig. 1 displays the geotext UNITED STATES and BRAZIL placed above left and below right, respectively, of the geo-text MEXICO. Since this relative placement is roughly consistent with the known geographic fact that the U.S. and Brazil are to the north and southeast of Mexico, one may conclude that the map in question is indeed of the Americas, given the coordinates of the geo-text relative to the frame. Our key assumption then is that the names of geographical locations such as UNITED STATES, MEXICO, and BRAZIL would not appear on a video frame with a nearly bimodal color scheme as (1) isolated, unjustified words and (2) at geographically consistent distances and directions unless it were a map of the Americas. This approach is preferable to map shape recognition because of its simplicity, generality, and scale-invariant nature. The map localization is of course approximate, since the graphic artist has some degree of freedom to arrange the text in a geographically consistent yet readable and uncluttered manner. However, we contend that an approximate geographic localization may be sufficient for the purposes of video content analysis. The following sections describe our approach for the detection and recognition of near-full-screen political maps in greater detail. 2. VIDEOTEXT EXTRACTION AND RECOGNITION The earliest efforts in videotext extraction and recognition were applied to text captions in commercially produced video 2, 3, 4. Certain constraints make the task of videotext extraction and recognition a challenging one: these include the low resolution of videotext, unconstrained font styles and sizes, poor separation of characters often resulting from compression and decoding, and complex, colorful moving backgrounds. Methods of detecting and locating text try to take advantage of distinguishing characteristics of text such as consistency in alignment, orientation, stroke thickness, character height, spacing, and intensity or color. All approaches have two main steps: (1) apply filters or other processing to produce a high response in text areas, and (2) coalesce the high-response pixels into regions or individual text lines. One class of methods uses color clustering techniques 5 or binarization 6 to identify pixels belonging to text, but these methods are not sufficient by themselves to robustly distinguish a wide variety of text amid complex backgrounds. Another class of approaches measures spatial frequency or texture with spatial variance 7, Gabor filtering 8, Gaussian filtering 9, or wavelets 4 to locate candidate regions of text. A third class of approaches 9, 10, 11, including our work, detects vertically oriented strokes or character edges, and then links or clusters them by a set of rules that depend on the characteristics of the individual elements. Therefore, compared to the texture-based filtering approach, in this approach coalescing can be more finely controlled so as to avoid including non-text image items close to or touching the text. Our videotext recognition process, shown in Fig. 2, operates on individual frames extracted from a video sequence. It produces OCR results that are time tagged: that is, each word in the OCR results corresponds to a single instance of text with a starting frame time (when it first appears) and an ending frame time (when it disappears from the image). The processing steps are arranged in a pipelined architecture that has a latency of up to several seconds. All of the processing is implemented in software in C++ and runs under Windows NT. In each processing cycle the following steps are executed: 1. Individual lines of text are located in the gray-scale image. 2. Each text line is binarized. 3. The OCR engine is applied to each of the binarized text lines. We now describe each of these steps in detail. 58 Proc. of SPIE Vol. 5010

3 2.1 Text localization Our approach to text location assumes that the text is roughly horizontal, and that the characters have a minimum level of contrast with the image background. The text may be of either polarity (light text on a dark background, or dark text on a light background). The process first detects vertically oriented edge transitions in the gray-scale image, using a Sobel operator. The output of the operator is thresholded to form two binary images, one for dark-to-light transitions (B 1 ), and the other for light-to-dark transitions (B 2 ). Fig. 3a shows a sample gray-scale image that contains both light and dark text, and Fig. 3b shows the corresponding light-to-dark edge transition image B 2 overlaid on the gray-scale image. A connected components algorithm is run on each binary image. The connected components that have been determined (by examining their height and area) not to be due to text are eliminated. Fig. 3c shows the eliminated connected components in red. The remaining connected components are linked to form lines of text by searching the areas to the left and right of each connected component for additional connected components that are compatible in size and relative position. Finally, a rectangle is fitted to each line of detected text. Fig. 3d shows the results of the text location process. This approach is quite fast and can accommodate the entire range of font sizes in a single processing pass through the image data. Text regions are eliminated if their height is less than 6 pixels or the height-to-width ratio is greater than 0.5. The parameters for text location were tuned so as to minimize the possibility of missing any text. 2.2 Binarization Binarization is performed on each text line independently. We assume that the text pixels are relatively homogeneous, and that the intensity of the background pixels may be highly variable. For each text line, the polarity of text is determined, and then a fixed threshold is chosen for the binarization of the text line. To determine the polarity of text, three histograms are computed. The gray-scale image is smoothed with a Gaussian kernel in preparation for computing histograms H 1 and H 2. Histogram H 1 is composed of gray-scale pixels in the smoothed image on the right side of the connected components in the dark-to-light edge transition image B 1 and on the left side of the light-to-dark edge transition image B 2. If light text is in this text region, these are the pixels most likely to belong to light text or near the edge of light text. Similarly, histogram H 2 is composed of gray-scale pixels in the smoothed image on the right side of the connected components in image B 2 and on the left side of the image B 1. H 3 is the histogram of a line of pixels immediately above the text line in the original gray-scale image. The gray-scale value G i at the peak of each histogram H i is found. The polarity of the text is determined by the following: if G 1 G 3 < G 2 G 3 GMinDiff, then the text is light; else if G 2 G 3 < G 1 G 3 GMinDiff, then the text is dark; else text could be either light or dark. The threshold for the text line is then set to the gray value at the 80th percentile in histogram H 1 or H 2, depending on the polarity chosen. If the text could be either light or dark, binarizations of both polarities are sent to the OCR processing step. 2.3 OCR The binarized text lines are deskewed and packed into a single buffer for processing by the OCR engine. We used the Caere Corporation DevKit2000 commercial OCR package. The output of the OCR process is a series of hierarchical structures (one for each processed frame) of text lines, words, and characters with multiple candidate identities for each recognized text character in the image, rank ordered according to likelihood. A confidence value is associated with the top-ranked character. These structures were stored in Document Attribute Format Specification (DAFS) format, 12 a standard for representing OCR and document image decomposition data. Proc. of SPIE Vol

4 3. MAP DETECTION The first phase of our system identifies those video frames with full-screen maps, based on the text localization and recognition results obtained as described above, and a rough image-level feature analysis. Each frame is processed independently. 3.1 Feature extraction The following features were designed to characterize and distinguish maps from other on-screen objects. Color homogeneity features Most full-screen political maps from news broadcasts can be characterized by a small number of primary colors, typically corresponding to land, water, and nation boundaries. Shades of blue usually represent bodies of water. Shades of gray, brown, or green usually represent land. The land and water segments usually constitute most of the map area and are largely spatially contiguous and homogeneous in color. Political boundaries between nations are often displayed in darker shades of the color chosen for the land sections of the map. The following features attempt to quantify these characteristics. 1. Segmentation error: This feature is defined as the mean squared error between the original image and the segmented image. 2. Color reduction factor: We compute the reduction in the number of image colors due to segmentation as the ratio of the number of colors in the segmented image to the number of colors in the original image. Contour features Compared with most non-map content from news broadcasts, maps have sparser contour lines, usually resulting from the presence of boundaries of nations or states. We compute contour sparseness as the fraction of the total number of nontext pixels that belong to contour lines. Content-independent text features Videotext is usually displayed in contrasting colors for better readability. Most text on full-screen political maps corresponds to names of geographic locations such as nations or cities. Caption text, such as the title of the story, may sometimes be present, typically in the bottom third of the frame. Text designating geographical locations tends to appear as single, isolated words spread over the image. Most names of nations are composed of one to two words. Two-word nation names are typically displayed as one single line of text or two left-justified lines of text with one word each. We use the following two text features: 1. Average contrast for text: We define the average contrast for text as the average ratio of gray values for the foreground and background pixels in videotext. 2. Text distribution index: We characterize text distribution as the fraction of the total number of videotext objects that are isolated, that contain a maximum of two words, and that satisfy the above-described justification constraints. Lexicon-based features Most isolated words on a political map are names of geographical locations such as nations and cities. We define a lexicon-match index feature as the number of isolated single or two-word videotext objects that match one of the entries in the specified lexicon of geographical locations, with the degree of match exceeding a given minimum acceptable level. 3.2 Feature matching To detect the presence or absence of a full-screen map, the above-defined features are used in a manually configured decision tree. The decision rules attempt to encode the typical characteristics of full-screen maps described above. Our 60 Proc. of SPIE Vol. 5010

5 ongoing work has focused on automating the generation of the decision tree via the use of machine learning methods such as C MAP RECOGNITION As stated earlier, the method of map recognition presented in this paper is based on the premise that the relative placement of text superimposed on a map roughly corresponds to the geographic locations the text represents. Such placement of text makes a map more readable and understandable by the viewer, which is critical in light of the brief time the map is displayed. This approach can be expected to be computationally less expensive than a generic scaleindependent shape recognizer for matching nation-boundary contours. Such an analysis, however, must address three coupled uncertainties. 1. Uncertainty in placement of text: Within the loose limits set by the geographical boundaries of the nation in question, the graphic artist may have significant freedom to place the text for better appearance and/or readability. This is especially true for large countries. 2. Uncertainty in perceived content of text: The extraction and recognition mechanism for videotext is not perfect: the text may not be detected at all, only a part of the text may be detected, word segmentation may be inaccurate, and the character recognition results may have errors. 3. Uncertainty of scale: The scale of the map, in terms of latitudes or longitudes per a unit distance in the video frame, is not known a priori. In addition, the curvature of the earth s surface may cause the scale to be different in different parts of the same map. 4.1 Iterative optimization procedure When multiple geo-text objects occur in the same frame, the resulting redundancies may mitigate the above uncertainties. In other words, one may be able to iteratively optimize the global consistency of the frame with respect to the placement and content of the text and the scale of the map. Given the results of text extraction and recognition for isolated videotext objects, our map recognition algorithm consists of the following steps: 1. Initialize the set of potential geo-texts to include all isolated text objects containing a maximum of two words. 2. For a lower lexicon-match threshold, reduce the current set of potential geo-texts to those that result in an acceptable match with any one of the lexicon entries. If this set contains less than two objects, stop. The frame does not have enough information to recognize this map. 3. Set the actual or pixel-domain X and Y coordinates of each geo-text object to those of the centroid of its bounding box. 4. For each geo-text object that matches the name of a city, set its expected or geographical X and Y coordinates to be its latitude and longitude, respectively. For each geo-text object that matches the name of a nation, set its expected X and Y coordinates to be the latitude and longitude of one of its major cities. Iterate step 5 for all major cities and all geo-texts that match the name of a nation. In other words, use the major cities from a nation as sample points where the geo-text could be placed. The assumption here is that the major cities adequately cover the geographical area of the entire nation. 5. Compute the acceptability of the current set of hypotheses by calculating the least-squares linear fit between the set of actual and expected coordinates of all geo-texts. 6. With ideal recognition, placement, and linear geographical scale, the actual and expected coordinates would be perfectly linearly related. The slope of this line would correspond to the scale of the map, and the intercepts would correspond to the offset. Compute the best set of sample cities for the nation geo-texts by maximizing the acceptability of the results from step If the maximum acceptability of the results from step 5 is less than a given threshold, sufficiently increase the match threshold from step 2 and iterate steps 1 through 6. This reduces the possibility that an incorrect lexicon match might prevent convergence. Proc. of SPIE Vol

6 8. The parameters of the linear fit provide the geographical scale and offset of the map in the image. Calculate the latitudes and longitudes of the four corners of the full-screen map. Map recognition is complete. Note that step 2 requires a minimum of two geo-text objects in a given frame to match the lexicon entries, since it takes at least two data points to compute a linear fit. 4.2 Temporal agglomeration For enhanced readability and emphasis, maps and the accompanying videotext are typically displayed over multiple frames at a time. Our experiments indicated that a full-screen map is typically displayed for a minimum time of 5 s. Unlike still images, this temporal contiguity of video content presents a redundancy that can be exploited to improve the accuracy of map detection and localization. Our approach for temporal aggregation in the context of full-screen map recognition consists of area-based histogramming over the duration of the recognized map. The thresholded plateau of this histogram represents a consensus area over many temporally recognized frames and mitigates the effect of the recognition errors and placement uncertainties discussed above. 5. RESULTS We applied the above algorithms for map detection, localization, and temporal agglomeration to three illustrative MPEG-1 video clips from the CNN interactive newsroom (available at Figs. 4 and 5 show sample map localization results. Fig. 4a shows an example frame with a map of the Korean peninsula. As can be seen in Fig. 4b, our analysis successfully localizes the geographical region depicted in this video frame, based solely on the relative positions of the videotext in the frame. Fig. 5 presents the intermediate steps of our analysis in more detail. The input image from Fig. 5a was subjected to videotext extraction and recognition, the raw results of which are presented in Fig. 5b. The text strings that resulted in an acceptable degree of match with a lexicon of country names is shown in Fig. 5c. Note that the word CHINA is detected twice. The iterative optimization procedure from Section 4 is then applied to jointly improve on the degree of lexicon matches and overall consistency with expected geographic locations. Figures 5d and 5e show the results of this procedure. We can see that the region depicted in the video frame was successfully localized. Our test videos contained over 45,000 frames, of which roughly 2000 contained full-screen political maps. The video frames with maps were manually ground truthed to record the geographical area covered in the map. We define the following area-based precision and recall metrics to quantify the map localization performance of the agglomerated result for each temporally contiguous map sequence. The area-based precision is defined as the ratio of the area of the overlap between ground-truthed and predicted map localization to the area of the predicted map localization. In an analogous manner, the area-based recall is defined as the ratio of the area of the overlap between ground-truthed and predicted map localization to the area of the ground-truthed map localization. For each agglomerated map sequence, we assumed that area-based precision and recall ratios greater than 75% constituted an acceptable overlap for the purposes of content analysis. Given this threshold, our method was able to successfully localize 75% of the political map sequences in the CNN video clips. Lack of success in some cases was due to the presence in some of the frames of maps with only a single geo-text object, which prevented the estimation of map scale and offset based on a linear fit. There were no false alarms. 6. CONCLUSION This work has demonstrated a novel approach to video content analysis that does not require rigorous object recognition. Superimposed videotext is often used by broadcasters as a precise mechanism to convey specific information to the viewer. Our work makes use of this fact by analyzing not only the textual content of the recognized videotext, but also its syntactic attributes such as its relative location, color, and size. This concept is illustrated in this paper as a system that detects and recognizes near-full-screen political maps from news video. It has been demonstrated to work well without expensively rigorous object-recognition computation. 62 Proc. of SPIE Vol. 5010

7 REFERENCES 1. R. Houghton, Named Faces: Putting Names to Faces, Intelligent Systems, 14:5, pp R. Lienhart, Indexing and Retrieval of Digital Video Sequences based on Automatic Text Recognition, in Proc. Fourth ACM International Multimedia Conf., T. Sato, K. Takeo, E. Hughes, and M. Smith, Video OCR for Digital News Archive, in Proc Intl. Workshop on Content-Based Access of Image and Video Databases (CAIVD 98), H. Li and D. Doermann, Automatic Identification of Text in Digital Video Key Frames, in Proc. Intl. Conf. on Pattern Recognition, pp , A. Jain and B. Yu, Automatic Text Location in Images and Video Frames, in Proc Intl. Conf. on Pattern Recognition, pp , J. Ohya, A. Shio, and S. Akamatsu, Recognizing Characters in Scene Images, IEEE Transactions on Pattern Analysis and Machine Intelligence, 16:2, pp , Y. Zhong, K. Karu, and A. Jain, Locating Text in Complex Color Images, in Proceedings of the Third International Conference on Document Analysis and Recognition, A. Jain and S. Bhattacharjee, Text Segmentation Using Gabor Filters for Automatic Document Processing, Machine Vision and Applications 5, pp , V. Wu, R. Manmatha, and E. Riseman, Automatic Text Detection and Recognition, in Proc. Image Understanding Workshop, pp , G. Myers, J. Herson, J. DeCurtins, R. Bolles, and A. Stolcke, Multimodal Fusion for Autonomous TV Monitoring (AVTM): Phase 3 Final Report, ITAD-1681-FR , SRI International, Menlo Park, California, M.A. Smith and T. Kanade, Video Skimming for Quick Browsing Based on Audio and Image Characterization, Technical Report CMU-CS , Carnegie Mellon University, DAFS.ORG: Supporting the Document Attribute Format Specification (DAFS) standard, Proc. of SPIE Vol

Binarization Binarized Text Lines OCR OCR Results Figure

8 Figure 1: Near-full-screen political maps in news videos. Gray -Scale Image Text Location Text Line Coordinates Binarization Binarized Text Lines OCR OCR Results Figure 2: Videotext recognition process. 64 Proc. of SPIE Vol. 5010

9 (a) Input image (b) Transition image overlay (c) Candidate connected components (d) Located videotext Figure 3: Results of text location (a) Input image (b) Geographical localization Figure 4: Map recognition results. Proc. of SPIE Vol

10 (a) Input image (b) Raw recognized videotext (c) Place name lexicon matches (d) Geographically consistent matches Figure 5: Map recognition results. (e) Localized map area 66 Proc. of SPIE Vol. 5010

Text Enhancement with Asymmetric Filter for Video OCR. Datong Chen, Kim Shearer and Hervé Bourlard

Text Enhancement with Asymmetric Filter for Video OCR Datong Chen, Kim Shearer and Hervé Bourlard Dalle Molle Institute for Perceptual Artificial Intelligence Rue du Simplon 4 1920 Martigny, Switzerland