Broken Characters Identification for Thai Character Recognition Systems

Similar documents
The Clustering Technique for Thai Handwritten Recognition

Thai Text Localization in Natural Scene Images using Convolutional Neural Network

ค ม อการใช งาน ET-BASE GSM SIM900

Hand Motion Analysis for Thai Alphabet Recognition using HMM

Proposal for the Thai Script Root Zone LGR

MACHINE REPRESENTATION OF OTHER DATA FORMATS (TEXT CHARACTERS, STRINGS).

ZONE: LOOX LED No. Art No. Description Unit Unit Rate PANEL 1: COMPLETE 12V / 24V / 350MA SOLUTIONS

IDN Root Zone LGR Workshop. Sarmad Hussain ICANN 55 9 March 2016

Looking forward to a successful coopertation : TEIN

ISI Web of Science. SciFinder Scholar. PubMed ส บค นจากฐานข อม ล

MULTI-FEATURE EXTRACTION FOR PRINTED THAI CHARACTER RECOGNITION

ว ธ การต ดต ง Symantec Endpoint Protection

What s Hot & What s New from Microsoft ส มล อน นตธนะสาร Segment Marketing Manager

INPUT Input point Measuring cycle Input type Disconnection detection Input filter

โปรแกรมท น าสนใจส าหร บไตรมาสน

Chapter 3 Outline. Relational Model Concepts. The Relational Data Model and Relational Database Constraints Database System 1

Matching Entities from Bilingual (Thai/English) Data Sources

กองว ชาประว ต ศาสตร ส วนการศ กษา โรงเร ยนนายร อยพระจ ลจอมเกล า 18 ต ลาคม พ.ศ. 2549

The New Effective Tool for Data Migration from Old PACS (Rogan) to New PACS (Fuji Synapse) with Integrated Thai Patient Names

Verified by Visa Activation Service For Cardholder Manual. November 2016

C Programming

เคร องว ดระยะด วยแสงเลเซอร แบบม อถ อ ย ห อ Leica DISTO ร น D110 (Bluetooth Smart) ประเทศสว ตเซอร แลนด

Fundamentals of Database Systems

Framework based on Mobile Augmented Reality for Translating Food Menu in Thai Language to Malay Language

OCR For Handwritten Marathi Script

Proposal to Encode Lao Characters for Pali

Chapter 9: Virtual-Memory Management Dr. Varin Chouvatut. Operating System Concepts 8 th Edition,

6. Applications - Text recognition in videos - Semantic video analysis

AJAX เสถ ยร ห นตา ส าน กเทคโนโลย สารสนเทศและการส อสาร มหาว ทยาล ยนเรศวร พะเยา

MT7049 การออกแบบและฐานข อม ลบนเว บ

JavaScript Framework: AngularJS

C Programming

Chapter 8: Memory- Management Strategies Dr. Varin Chouvatut

Optical Character Recognition (OCR) for Printed Devnagari Script Using Artificial Neural Network

Thailand s Social Media Analysis regarding the Great East Japan Earthquake Investigate target social media candidates

Chapter 4. Introducing Oracle Database XE 11g R2. Oracle Database XE is a great starter database for:

Specifications 14TB 12TB 10TB 8TB 6TB 4TB 3TB 2TB 1TB

ว.ว ทย. มข. 45(2) (2560) KKU Sci. J. 45(2) (2017) บทค ดย อ ABSTRACT

Crystal Report & Crystal Server 2016

ประว ต ว ทยากร ห วข อ 3G Technologies & 4G Roadmap ในว นพฤห สบด ท 3 ธ นวาคม 2552 เวลา น.

A Novel Smoke Detection Method Using Support Vector Machine

HCR Using K-Means Clustering Algorithm

A New Approach to Detect and Extract Characters from Off-Line Printed Images and Text

PUBLIC TRAINING PLAN

I/O. Output. Input. Input ของจาวา จะเป น stream จะอ าน stream ใช คลาส Scanner. standard input. standard output. standard err. command line file.

A First Book of ANSI C Fourth Edition. Chapter 9 Character Strings

Indian Multi-Script Full Pin-code String Recognition for Postal Automation

Lecture 6 Register Transfer Methodology. Pinit Kumhom

Chapter 11 Object-Oriented Design Exception and binary I/O can be covered after Chapter 9

A Study of Thai Succession Law Ontology on Supreme Court Sentences Retrieval

I-Pats: An Intelligent Search System for Thai Patents

Thai Food Safety Document Searching System by Ontology

Computer Graphics. Chapter 5 Geometric Transformations. Somsak Walairacht, Computer Engineering, KMITL

ร จ กก บ MySQL Cluster

CPE 426 Computer Networks. Chapter 5: Text Chapter 23: Support Protocols

Application Protocols

Separation of Overlapping Text from Graphics

Handwritten Devanagari Character Recognition Model Using Neural Network

Structure of Department

Recognition of Off-Line Handwritten Devnagari Characters Using Quadratic Classifier

แผนการสอนว ชา การเข ยนโปรแกรมคอมพ วเตอร 2 (Computer Programming 2) ภาคการศ กษา 1 ป การศ กษา 2559

An Efficient Character Segmentation Based on VNP Algorithm

An Efficient Character Segmentation Algorithm for Printed Chinese Documents

A Survey of Problems of Overlapped Handwritten Characters in Recognition process for Gurmukhi Script

Today Topics. Artificial Intelligent??? Artificial Intelligent??? Intelligent Behaviors. Intelligent Behavior (Con t) 20/07/52

บทท 4 ข นตอนการทดลอง

IS311. Data Structures and Java Collections

Medical Information. Objectives. Literature Search : PubMed. Know. Evaluation. Medical informatics Literature search : PubMed PICO Approach

Lecture Outline. 1. Semantic Web Technologies 2. A Layered Approach 3. Data Integration

ภาคผนวก ก Coding VPN IPSec Site-to-Site

Revised Proposal to Encode Lao Characters for Pali

Course: Project Management Learning world class business project management skills

Tonal Language Speech Compression Based on a Bitrate Scalable Multi-Pulse Based Code Excited Linear Prediction Coder

Optical Character Recognition

Evaluation of Web Search Engines with Thai Queries

CATALOGUE N1506TH

PAST RESEARCH TOPICS

AN EFFICIENT BINARIZATION TECHNIQUE FOR FINGERPRINT IMAGES S. B. SRIDEVI M.Tech., Department of ECE

An SMS-Based Fault Dispatching System: An Additional Utilisation of a Mobile Phone Infrastructure

INFINITE. HF RFID Specification

Parallel K-means Clustering Algorithm on NOWs

วารสารส งคมศาสตร ป ท 3 ฉบ บท 2 กรกฎาคม ธ นวาคม 2557 บทค ดย อ

DATA EMBEDDING IN TEXT FOR A COPIER SYSTEM

Segmentation of Characters of Devanagari Script Documents

SEARCH STRATEGIES KANOKWATT SHIANGJEN COMPUTER SCIENCE SCHOOL OF INFORMATION AND COMMUNICATION TECHNOLOGY UNIVERSITY OF PHAYAO

Robust PDF Table Locator

Example: How to create a shape from SpecialShapeFactory.

A Document Image Analysis System on Parallel Processors

Layout Segmentation of Scanned Newspaper Documents

Ethernet'Basics. Topics

Offline Signature verification and recognition using ART 1

(ATMS01) OPENING CEREMONY Automotive Summit 2018 and Manufacturing Expo 2018 Developmental Approach for SMEs Access to Automation,

Source Code ช อ frmcar

Interactive Segmentation and Three-Dimension Reconstruction for Cone-Beam Computed-Tomography Images

Internet Applications

Glossary. Mathematics Glossary. Elementary School Level. English Thai

ร ปแบบใหม ของการต ดต อส อสารไร สาย

ElasticHosts Correspondents

Computer Graphics. Lecture 3 Graphics Output Primitives. Somsak Walairacht, Computer Engineering, KMITL

ภาคผนวก ก การต ดต งโปรแกรม

Transcription:

Broken Characters Identification for Thai Character Recognition Systems NUCHAREE PREMCHAISWADI*, WICHIAN PREMCHAISWADI* UBOLRAT PACHIYANUKUL**, SEINOSUKE NARITA*** *Faculty of Information Technology, Dhurakijpundit University, Bangkok, 2, THAILAND **Faculty of Information Technology, Mahanakorn University, Bangkok, THAILAND ***Department of Electrical, Electronics and Computer Engineering, School of Science and Engineering, Waseda University, Tokyo, JAPAN Tel. (8)3-5286-3374 Fax. (8)3-322-994 Abstract: - This paper presents a scheme for solving the problem of broken characters in Thai character recognition systems. The proposed scheme consists of two main steps namely: broken character identification and connection of broken characters. There are two techniques used to identify broken characters as follows: ) an overlapping area and 2) character code. The specific characteristics of Thai characters are also employed in the scheme such as position of the head, leg and endpoint of characters. The experimental results show that the scheme can identify and repair broken characters both in normal and bold font efficiently. The recognition rate of commercially available software is improved significantly. Key-Words: - Broken characters, Thai character, Character recognition, Vertical broken, Character code Introduction Documents or forms are typically prepared only as originals. If additional copies are needed, they are made from the original. Each time copies are made, the quality of the image deteriorates, which can result in broken characters. The output quality from a typical facsimile is usually low in terms of clarity, because of low image resolution. In Thailand, broken characters are usually seen in fax-transmitted documents, especially from the transmission of copied (non-original) documents. When rather thin sheets of paper are fed into a printer, small folds are sometimes be found, which can result in broken characters in the folded areas. Sheets which have already been used on one side are often re-used on the other side for economic reasons, and folding can occur easily due to changed characteristics caused by heat from the first use. This phenomenon can often be found in sheets re-used in laser printer or copy machines. The completeness of the input character images affects the accuracy of a character recognition system significantly. Therefore, much research on character recognition systems has focused on preprocessing processes such as segmentation of touching characters [], document image binarization [2]. Although much research has been done on Thai character recognition [3] [4], this problem still exists and causes erroneous results in recognition systems. Table shows an example of a recognition result obtained by using commercially available Thai language OCR software for recognition of broken character images. Table. Example of an OCR recognition result for broken characters. Test data Correct result ถ กอ กขระ ThaiOCR Software ถ กeา f ระ When a character image is broken into several segments, these segments cannot be treated as a single complete character. A segment is defined as a rectangular area that covers connected pixels. A simple merging of these small segments is not feasible, since it is not clear beforehand which segments belong to which character. Thus, a more advanced technique is required to solve this type of

problem: grouping segments of characters to form an entity representing a character. A method for reconstructing damaged characters was proposed in [6], but it can only be used for characters damaged by lines of a table. It does not address the problem of broken characters and cannot be applied in the case of Thai characters. This paper is an extension of the method proposed by the authors in [5]. The proposed scheme can be applied to characters broken along the vertical line which can not be supported in [5]. 2 Thai Character Set and their Characteristics Thai character set consists of 44 consonants, 5 vowels and 8 voice tones as shown in Table 2 [2]. The character อ in vowels and voice tones is used only to represent the location of vowels and voice tones characters. In writing Thai words, it will be replaced by one of a consonant character. There is no spacing between words and no special mark to identify the end of a sentence. The Thai vowel forms do not follow initial consonants; some are placed before the initial consonants, some after the consonants, some above the consonants, and some underneath the consonants. The vowels that are complex forms (i.e. composed of more than one part) can be placed around the consonants. Type Consonants Vowels Voice tones Table 2. Thai character set Member ก ข ค ง จ ฉ ช ซ ฌ ญ ฎ ฏ ฐ ฑ ฒ ณ ด ต ถ ท ธ น บ ป ผ ฝ พ ฟ ภ ม ย ร ล ว ศ ษ ส ห ฬ อ ฮ ะ า อ อ อ อ อ อ เ แ โ ไ ใ อ อ อ อ อ อ อ อ ฯ ๆ A Thai sentence is consists of up to the maximum of three zones. A character may occupy in one of the three zones namely: the upper zone (UZ), central zone (CZ) and lower zone (LZ). The upper zone is also classified into two sub-zones namely: the upper zone (UZ) and upper zone 2 (UZ2) as shown in Fig.. The multi-level structure of a Thai sentence made it looks very complicate and difficult to recognize. However, on the opposite side, if the zone information is obtained, it will be very useful to classify characters into groups with a smaller number of members. The zone information is obtained by using histogram and each character block is obtained by using edge detection algorithms. 3 Pre-processing process The objective of the pre-processing process is to determine the segment boundaries (SBs) and classify each SB into one of the three zones mentioned in the previous section. An SB is defined as the smallest rectangular area that covers all of a set of connected pixels. Each SB may consist of a single complete character, several connected characters, or a part of a broken character. The pre-processing process consists of two steps: segment boundary determination and zone classification []. An example of input and output of this process is the sequence of SBs, as shown in Fig. 2. UZ CZ LZ เช าซ อกล ม Fig.. A sample of a Thai sentence. UZ UZ2 a) input b) the sequence of SBs. Fig. 2. An input and output of the pre-processing process. 4. Identification of Broken Characters In this paper, only vertically broken characters are considered. The horizontally broken characters already reported by the authors in [5]. In this process, there are two techniques that used to identify broken characters namely: an overlapping area and character code as follows. 4. An Overlapping Area The coordinate in x-axis of each SB is used to determine whether there is any overlapping area. The overall process of the step is shown in Fig. 3.

The process to find overlapping area Normal character Broken character Document image Character segmentation SBs ( x Max<=x 2 Max) Y and( x Max>=x 2 N Max, y Max) Max, y 2 Max) (x Max, y Max) Max, y 2 Max) Max, y 2 Max) (x Max, y Max) SB (x 2 Max<=x Max) and ( x 2 Max>=x N Y Connection process SB Broken area SB Find character code Fig. 3. The overall process of finding an overlapping area. Table 3. Character code in 6 patterns. No. Structure Character code Character member No. Structure Character code Character member 2 3 ก ค ฅ ฆ ฉ ฑ ฒ ด ต ท พ ฟ ภ ศ ส ห ฬ ข ฃ ฆ ช ซ ฌ ญ ณ น บ ป ผ ฝ ม ษ ง จ ฉ ช ซ ฎ ฏ ฐ ถ ธ ผ ฝ ย ล ศ ษ ส อ ฮ 9 Broken character Broken character Broken character 4 Broken character 2 Broken character 5 เ ไ ใ 3 Broken character 6 ฉ ฐ ร ว อ 4 Broken character 7 โ 5 No character 8 Broken character 6 No character

4.2 Specific feature and Character Code The process is used to identify broken characters that can not be found by using an overlapping area described previously. In this process, the specific feature of Thai character such as head of a character, the circle used as the starting point of writing a character as shown in Fig.4. 2 3 4 a) Area of head position b) A character has no head the shortest distance is selected for connecting these broken characters. 5.2 Connection by using Character Code All characters that passed to step 5. are also passed to the process to check by using character code again. The character code and head position of a character are used to determine which part, the next or previous block, should be used to form a complete character as examples shown in Fig.6. ฎ บ previous block next block c) A character has one d) A character has 2 head position at heads position at and 4 Fig. 4. Head position of a character present block ถ ย The character code is constructed by finding the boundary from center of the character in four directions namely: top, bottom, left and right. The example of finding a character code is shown in Fig. 5. Character code is Fig. 5. Examples of character code. next block present block Fig.6. Connection by character code 6. Results and Conclusion The method was implemented by using Visual Basic Version 6.. The test documents are obtained from image scanner at resolution 3 dpi. Fig. 7 shows examples of broken character images and the reconstruction result of these images are shown in Fig. 8. The character code of Thai character set can be classified into 6 patterns as shown in Table 3. The character code is not only used to identify Thai broken characters but also determined the area that characters could be connected together. 5. Connection of Broken Characters The process is used to connect the broken pieces of character together. The process consists of two steps namely; connection by considering the overlapping area and connection by using character code. 5. Connection by considering the overlapping area In this process, the position of broken character already known from previous processes. Therefore, Figure 7. Examples of broken character images.

recognition errors also depends on the efficiency of the recognition software itself. From the experimental results, it can be concluded that the proposed scheme can be used to reconstruct broken characters efficiently and will be useful in significantly improving the recognition rate of Thai character recognition systems. Figure 8. Result of reconstruction. The effectiveness of the proposed scheme can be found by using the commercially available Thai character recognition software. The recognition results of character images shown in Fig. 7 and Fig.8 by using the commercially available Thai character recognition software are presented in Table 4. Table 4. Recognition results of character images in Fig. 7 and Fig. 8. Recognition with ThaiOCR Version.5 Broken characters Repaired characters เ น ยงเ IWาตา 8jย8ยาใาร เพ ยงเมตตาช วยอาทร ใะ จร งไร อยากจาก ใจจร งไม อยากจาก [lกe าf)t'ร ถ กอ กๆระ 4 ท 3J ตะr ะJสะสมPe)L ส ะร ผ ท ม ตะก วสะสมอย ส ง f]ารฉ 5 ฯe eค ร ง เสร 4 การฉ ดฮอร โมนเสร ม The results shown in Table 4 also are in the form as they are in the output of the software. It can be seen that the software can not recognize almost of these broken characters if they are not repaired. 6. Conclusion A scheme for detection and reconstruction of broken printed Thai characters has been proposed. Broken characters are identified by using two techniques namely: an overlapping area and character code. The use of a projection profile and the overlapping areas of broken segments make the reconstructed character images very similar to the original characters. It can be seen that the correctness of the recognition results of repaired characters is much higher than that of the results of broken characters. However, there are still some erroneous recognition results, although the reconstructed characters are very similar to the original characters. The rate of References: [] N. Premchaiswadi, W. Premchaiswadi, and S. Narita, Segmentation of Horizontal and Vertical Touching Thai Characters, IEICE Trans. Fundamentals, Vol. E83-A, No.6, pp. 987-995, 2. [2] M. Kamel and A. Zhao, Extraction of Binary Character/Graphics Images from Grayscale Document Images, Computer Vision, Graphics and Image Processing 55, pp. 23-27, May 993. [3] P. Hiranvanichakorn and M. Boonsuwan, Recognition of Thai Characters, Proceedings of SNLP 93, pp. 23-66, 993. [4] C. Kimpan and S. Walairacht, Thai Character Recognition, Proceedings of SNLP 93, pp. 96-26, 993. [5] W. Premchaiswadi, N. Premchaiswadi, P. Limmaneewichid and S.Narita, Detection and Reconstruction Scheme for Broken Characters in Thai Character Recognition Systems, IPSJ Journal, Vol. 43, No. 6, pp.866-879, 22. [6] K. Lee, H. Byun, and Y. Lee, Robust Reconstruction of Damaged Character Images on the Form Documents, Lecture Notes in Computer Science 389, Springer Verlag, pp. 49-62, 998. [7] E. R. Davies, Machine Vision, Academic Press, 997. [8] S. V. Rice, G. Nagy, and T.A. Nartker, Optical Character Recognition: An Illustrated Guide to the Frontier, Kluwer Academic Publishers, 999.