An Empirical Comparative Study of Online Handwriting Chinese Character Recognition:Simplified v.s.traditional

2013 12th Internatonal Conference on Document Analyss and Recognton An Emprcal Comparatve Study of Onlne Handwrtng Chnese Recognton:Smplfed v.s.tradtonal Yan Gao, Lanwen Jn +, Wexn Yang School of Electronc and Informaton Engneerng South Chna Unversty of Technology Guangzhou, Chna thegoldfshwang@163.com, +lanwen.jn@gmal.com, wxy1290@163.com Abstract There are two forms of Chnese character wdely used n the world, smplfed Chnese and tradtonal Chnese. Smplfed Chnese s manly used n Chnese manland, Sngapore, Malaysa and other Southeast Asan regons, whle tradtonal Chnese s manly used n Hong Kong, Macao, Tawan, Japan and so on. The smplfed Chnese s transformed from tradtonal Chnese by smplfyng the character structure and reducng the stoke number of many tradtonal characters, whle a few reman. The smplfed Chnese reduces the memory storage and makes handwrtng become more convenent, whle t also brngs the problem that many characters are so smlar that they are dffcult to be recognzed. In ths paper, an emprcal study of the handwrtng smplfed/tradtonal Chnese character recognton are carred out n order to compare the dfference between these two forms of Chnese characters. The expermental results based on SCUT-Couch2009 database show that the handwrtng recognton accuracy of tradtonal Chnese s hgher than smplfed Chnese, for both and changed. Ths nterestng fndng may brng us some cues on the ssue of confusable Chnese character recognton for further study. Keywords-handwrtng recognton; smplfed Chnese; tradtonal Chnese I. INTRODUCTION Chnese s one of the oldest wrtng languages n the world and may be the only ancent language stll n use today [1]. As an deographc language, Chnese nherts the long hstory and culture of Chnese naton durng thousands of years. It s the language most wdely spread and used n the world, especally has the far-reachng mplcatons on Chna and the whole East Asan. The total number of Chnese character s about one hundred thousand. However, most of Chnese characters are varant and unfrequently used characters, only thousands of them are n daly use. Accordng to statstcs, 3,755 frequently used characters can cover more than 99% of the wrtten materal n Chna. Nowadays, there are two forms of Chnese character wdely used n the world, smplfed Chnese and tradtonal Chnese. Smplfed Chnese was used n the world snce 1950s, whch was proposed as the modern Chnese wrtng standard after the foundng of New Chna. However, the tradtonal Chnese s stll used n Hong Kong, Macao, Tawan now. Smplfed Chnese s generated from tradtonal Chnese by smplfyng the character structure and reducng the stoke number of most tradtonal characters. Smplfed Chnese has many advantages: t reduces the character stoke numbers; smplfes the character structure; and reduces the number of frequently used characters. All these advantages make Chnese easy to learn, read and wrte. Smplfed Chnese reduces the memory storage dffculty and makes handwrtng become more convenent. However, as an deographc language, smplfed Chnese reduces the nner meanng of Chnese characters, furthermore, t results n many smlar characters dffcult to recognze, such as and, and, and, the tradtonal forms are and, and, and, whch s sgnfcantly easer to dstngush. In recent years, as an mportant research drecton n pattern recognton feld, handwrtng Chnese character recognton (HCCR) technology has made great progress and the recognton accuracy acheved more than 98% on certan constraned databases [2-3] and more than 95% on realstc unconstraned databases [4-6]. However, t was noted n [6] that, the recognton accuracy of smplfed Chnese data s about 2% less than tradtonal Chnese data, although the two datas are collected n the same experment envronment and contrbuted by same wrters. To nvestgate ths ssue, we make a comprehensve comparatve study on the ssue of smplfed HCCR aganst the tradtonal HCCR. The rest of ths paper s organzed as follows: Secton II ntroduces the smplfed Chnese and tradtonal Chnese n detals. Secton III presents the handwrtng character recognzer we used n our experments. Secton IV gves the expermental results and compares the dfference between the smplfed Chnese and tradtonal Chnese. Conclusons are summarzed n Secton V. II. SIMPLIFIED CHINESE AND TRADITIONAL CHINESE Due to hstorcal and poltcal reasons, smplfed Chnese character s currently manly used n manland Chna whle tradtonal Chnese character s used n areas lke Tawan, Hong Kong and Macao. GB2312-80 s one standard of smplfed Chnese characters, whle BIG5 s the standard of 1520-5363/13 $26.00 2013 IEEE DOI 10.1109/ICDAR.2013.176 862

tradtonal characters. Although most of the smplfed Chnese characters are smplfed from ther tradtonal characters, qute a few s of the characters are. Examples of some smplfed and tradtonal characters are shown n Fgure 1. Examples of some characters n both s are shown n Fgure 2. (a) tradtonal Chnese (b) smplfed Chnese Fgure 1. The characters n GB and BIG5 Fgure 2. The characters n GB and BIG5 Smplfed Chnese s smplfed from tradtonal Chnese, n order to make Chnese easy to learn, read and wrte. There are four methods manly used n the process of generatng the smplfed Chnese characters [7]. The frst method s smplfyng the structure of tradtonal characters, ncludng several aspects such as followng. (1) A character s replaced by another exstng character wth same sound. (2) Some characters preserve the basc outlne. (3) Consderng the general cursvely wrtng habts, some of the characters are replaced by ther cursve forms. (4)Some characters are smplfed by replacng or omttng a complex component of the character. The new component can be ether a smple symbol or a near-sound component. (5) Some new characters are created to replace the tradtonal one. (6) Some ancent forms and varatons are adopted. Table I gves some examples n order to show how to smplfy the structure of tradtonal characters. TABLE I. EXAMPLES OF STRUCTURAL SIMPLIFICATION METHOD Detal Forms Examples Same Sounds Smlar basc outlne Use cursve form Replace component Omt component Create a new character Adopt ancent form Second, some characters are derved based on smplfed character components. Because of the smlarty of characters, smplfed characters can be created by systematcally smplfyng components. For nstance, s smplfed to, so from, and,, and can be made. s smplfed to, then, and convert to,,. And s smplfed to, thus, and convert to, and. Thrd, elmnate the varants of the same characters. A of varant characters often sound the same or share the same meanng, so the smplest one n form s chosen to represent ths of characters and the rest are abandoned. For example,, and are replaced by. And, and are replaced by. And and are replaced by. Fourth, adopt new standardzed character forms. Wth ths method, characters appear slghtly smpler than the old forms, and are as such mstaken as structurally smplfed characters mentoned n the frst method. Some example are shown as followng: convert to, to, to, to, and to. Wth these four smplfed methods, the average character stroke number per character s reduced from 16 to 10. Less tme and energy wll be spent n learnng and wrtng these smplfed characters. However, on the other hand, the energy-savng effect s lmted n the Internet Age and the smplfed characters ncrease ambguty and lose the orgnal meanng of tradtonal characters. Smplfed characters also cause many smlar characters dffcult to be recognzed. III. HANDWRITING CHARACTER RECOGNITION TECHNOLOGIES The general steps of handwrtng recognton nclude preprocessng, feature extracton, dmenson reducton, classfcaton and so on [8]. In ths paper, we use the compact MQDF handwrtng character recognzer [9] n the experments. The recognzer uses elastc mesh normalzaton and 8-drectonal feature extracton. The feature dmensonalty s reduced from 512 to 160 by Fsher lnear dscrmnant analyss (LDA). Fnally, modfed quadratc dscrmnant functon (MQDF) s used for classfcaton. A. 8-Drectonal Feature Extracton As an effectve feature extracton method, the 8- drectonal Feature [10] s wdely used n the feld of onlne handwrtng Chnese character recognton. The 8-drectonal feature s computed through the 8-drectonal vectors of each samplng pont. Then 8 drectonal pattern mages are generated accordngly, and the blurred drectonal features are extracted at 8 8 unformly sampled locatons usng a Gaussan flter. Fnally, a 512-dmensonal vector of raw features s formed. 863

B. Lnear Dscrmnant Analyss LDA [11] s a supervsed learnng method that can select the lower dmensonal sub-space features wth the most dscrmnatng nformaton. Mathematcally speakng, LDA can seek drectons for effcent dscrmnaton through maxmzng the between-class scatter whle mnmzng the wthn-class scatter. C. MQDF classfer The MQDF classfer proposed by Kmura et al. [12] s wdely used n handwrtng recognton for ts hgh classfcaton performance. The quadratc dscrmnant functon (QDF) s based on Bayesan decson rule, under the assumpton of multvarate Gaussan densty for each class. The MQDF s obtaned by makng a modfcaton to the QDF wth PCA transformaton and smoothng the mnor egenvalues by a constant, whch s shown n (1). _ 2 K _ 1 δ T 2 g1 ( x, ω ) = x x (1 )[ φ ( x x )] δ j= 1 λ K + j= 1 logλ + ( D K) logδ where x _ denotes the mean vector of class ω ; λ and φ denote the egenvalues and egenvectors respectvely of the covarance matrx of classω ; D s the dmenson of ; K s the number of domnant egenvectors and δ s a constant. IV. EXPERIMENTS AND ANALYSIS The data we used s SCUT-COUCH2009 [6,13], whch s a comprehensve database that conssts of 11 subs, ncludng smplfed and tradtonal Chnese characters, words, pnyns, letters, dgts, symbols and so on. SCUT-COUCH2009 s collected wth PDA (Personal Dgt Assstant) and smart phones wth touch screens, contrbuted by more than 190 dfferent persons, resultng n more than 3.6 mllon handwrtten samples. It s worth to note that SCUT-COUCH2009 s currently the only publc avalable database that ncludng the 5,401 BIG5 tradtonal Chnese character. In ths paper, we use the GB sub and BIG5 sub of SCUT-COUCH2009 as the smplfed Chnese and tradtonal Chnese datas respectvely. The two datas are collected n the same experment envronment. GB sub contans 188 s of handwrtten samples of 6,763 Chnese characters n GB2312-80 standard, whle BIG5 sub contans 65 s of 5,401 frequently characters n BIG5 standard. In order to ensure the experment envronment consstent, we randomly select 60 s for tranng and 5 s for testng n the both subs respectvely. The characters of GB standard are smplfed from BIG5 standard. The two character standards can be dvded nto three s respectvely: (1), where characters n ths are same n both GB and BIG5; (2), where characters n ths of GB and BIG5 have the same x _ (1) meanng but dfferent wrtng form; (3) dsjont, where n ths, characters n BIG5 does not appear n GB and vce versa. The detals of character number of each are gven n Table II. TABLE II. CHARACTER NUMBER OF EACH PART dsjont total GB 3,247 1,687 1,829 6,763 BIG5 3,247 1,687 467 5,401 In the followng experments, we compare the recognton accuracy of the GB and BIG5 datas, and analyss the dfference between the two datas. Due to that the dsjont s of GB and BIG5 are unrelated wth each other, the experments are manly focused on the and to make a far comparson. Table III shows the recognton accuraces of GB and BIG5 datas, where we only combne the and as the tranng and testng datas. From Table III t can be seen that the total accuracy of BIG5 s 2.57% hgher than that of GB data. For both and s, the accuraces of BIG5 are sgnfcant hgher than that of GB. TABLE III. RECOGNITION RATE OF UNCHANGED AND SYNONYM PART Average rate GB 95.12% 95.73% 95.34% BIG5 97.86% 98.01% 97.91% Besdes, we also carry the experments by usng the total 6,763 categores of GB and the 5,401 categores of BIG5. The smlar results are observed, as shown n Table IV. TABLE IV. RECOGNITION RATE OF THE TOTAL DATASETS dsjont Average rate GB 94.82% 95.19% 96.25% 95.05% BIG5 97.59% 97.83% 96.66% 97.61% One man dfference between smplfed and tradtonal Chnese s that the stroke number s dfferent. Fgure 3 shows the statstcal dstrbutons of the stroke number of GB and BIG5 datas, for standard GB and BIG5 Chnese s and for handwrtng subs form SCUT-COUCH 2009, respectvely. Fgure 3(a) shows the standard stroke number dstrbuton. It can be seen that the average stroke number of BIG5 s more than GB data. We can see that the average stroke number of the handwrtten samples n Fgure 3(b) s less than that of standard characters n Fgure 3(a), due to that many Chnese characters are wrtten cursvely, wth two or more strokes connected. However, the average stroke number of BIG5 s stll more than GB. 864

(a)the standard stroke number dstrbuton (b)the wrtng stroke number dstrbuton Fgure 3. The statstcs of the stroke number of GB and BIG5 Fgure 4 shows the hstogram relatonshp between recognton accuracy and the average stroke number. It can be seen that the average stroke number for both of BIG5 and GB datas are ncreasng wth the ncreasng of recognton accuracy, and the average stroke number of BIG5 s larger than GB when recognton accuracy s hgh. It s worth to note that the recognton accuraces of some few-stroke characters n GB are qute lower. Fgure 4. The relatonshp between recognton accuracy and the average stroke number It s known that the confusable smlar character recognton s the bottleneck to further mprove the performance of handwrtten Chnese character recognton. In order to analyze the recognton dfference of BIG5 and GB, we select some typcal characters and compare the recognton accuraces of ther correspondng smlar characters. The detals are shown n Table V and Table VI. Table V shows the recognton rate of 10 characters n the. It s obvously that the character accuraces of BIG5 are all hgher than GB, although the wrtng forms are same. Furthermore, for the correspondng smlar characters of the 10 gven characters n Table V, we can see that the smlar characters are much more dffcult to be correctly recognzed n GB data, such as and, and, and so on. However, these smlar character-pars n BIG5 are wrtten n tradtonal forms as and, and and so on, whch are much easer to be recognzed. Table VI shows the smlar results of the recognton rate of 10 characters n of both s. The tradtonal characters wth more stokes not only provde more feature nformaton, but also reduce the smlar characters and results n hgher recognton accuracy. It s worth to note that the two components and n GB data are usually wrtten n smlar way and hard to dstngush and the recognton accuraces of characters wth these components are usually lower than that of the correspondng tradtonal characters. V. CONCLUSIONS In ths paper, we have made an emprcal comparatve study of the handwrtng smplfed and tradtonal Chnese character recognton. Some nterestng fndng s observed, whch may be useful to brng us some cues on the ssue of confusable Chnese character recognton for further study. The experment results show that the average handwrtng stroke number of tradtonal Chnese s hgher than that of smplfed Chnese, and the character wth more stroke number usually result n hgher recognton accuracy. Furthermore, we compare the recognton accuracy of some characters n smplfed Chnese and tradtonal Chnese. It s obvously that the smplfed characters have more smlar characters that are dffcult to be recognzed, whch leads to performance decrease. Smplfed Chnese makes Chnese easy to learn, read and wrte. However, as an deographc language, smplfed Chnese reduces the nner meanng of Chnese characters, and brngs many smlar characters dffcult to recognze. Chnese character s stll n developng and t wll become more convenent for people to understand and use. It s suggested that we should also pay attenton on the tradtonal Chnese, because tradtonal Chnese nherts the long hstory and culture of Chnese naton. 865

TABLE V. RECOGNITION ACCURACY OF UNCHANGED PART The gven characters The Smlar characters characters Rate n GB Rate n BIG5 smplfed Rate n GB tradtonal Rate n BIG5 95.38% 100.00% 95.38% 100.00% 96.92 % 100.00% 96.92% 100.00% 96.92 % 98.46% 96.92% 100.00% 96.92 % 100.00% 95.38% 98.46% 96.92% 100.00% 95.38% 100.00% 98.46% 100.00% 98.46% 100.00% 98.46% 100.00% 98.46% 100.00% 96.92% 100.00% 96.92% 98.46% 95.38% 100.00% 96.92% 100.00% 98.46% 100.00% 93.82% 96.92% TABLE VI. RECOGNITION ACCURACY OF SYNONYM PART The gven characters The Smlar characters GB Rate n GB BIG5 Rate n BIG5 smplfed Rate n GB tradtonal Rate n BIG5 96.92% 98.46% 93.85% 100.00% 96.92% 100.00% 98.46% 100.00% 95.38% 100.00% 98.46% 96.92% 95.38% 100.00% 96.92% 98.46% 95.38% 100.00% 96.92% 96.92% 96.92% 98.46% 93.85% 100.00% 96.92% 100.00% 98.46% 100.00% ACKNOWLEDGMENT Ths research was ally supported by NSFC (no. 61075021), the Natonal Scence and Technology Support Plan (2013BAH65F01-2013BAH65F04), GDNSF (No. S2011020000541) and GDSTP (No.2012A010701001).. REFERENCES [1]. DeFrancs, John, The Chnese Language: Fact and Fantasy. Unversty of Hawa Press. ISBN 0-8248-1068-6. [2]. H. Lu and X. Dng, Handwrtten character recognton usng gradent feature and quadratc classfer wth dscrmnatve feature extracton, ICDAR2005, 2005, pp. 19-23. [3]. C. Lu, Hgh accuracy handwrtten Chnese character recognton usng quadratc slassfers wth dscrmnatve feature extracton, ICPR2006, 2006, pp. 942-945. [4]. Cheng-Ln Lu, Fe Yn, Da-Han Wang, Qu-Feng Wang, CASIA Onlne and Ofne Chnese Handwrtng Databases, Proc. 11th ICDAR, Beng, Chna, 2011 [5]. C. Lu, Fe Yn, Da-Han Wang, Qu-Feng Wang, Onlne and Offlne Handwrtten Chnese Recognton: Benchmarkng on New Databases, Pattern Recognton, 46(1): 155-162, 2013. [6]. L. Jn, Y. Gao, et al, SCUT-COUCH2009 A Comprehensve Onlne Unconstraned Chnese Handwrtng Database and Benchmark Evaluaton, Internatonal Journal on Document Analyss and Recognton, vol. 14, no.1, pp53-64,2011. [7]. Wkpeda: http://zh.wkpeda.org/wk/ [8]. C. Lu, S. Jaeger, M. Nakagawa, Onlne Recognton of Chnese s: The State-of-the-Art, IEEE Trans. Pattern Analyss and Machne Intellgence, vol. 26, no. 2, pp. 198-213, 2004. [9]. T. Long, L. Jn, Buldng compact MQDF classfer for large character recognton by subspace dstrbuton sharng, Pattern Recognton, Vol. 41, No. 9, pp. 2916-2925, 2008. [10]. Z. Ba and Q. Huo, A Study On the Use of 8-Drectonal Features For Onlne Handwrtten Chnese Recognton, ICDAR2005, 2005, pp. 232-236 [11]. R.A. Fsher, The Use of Multple Measurements n TaxonomcProblems. Annals of Eugencs 1936, Vol. 7, pp. 179-188. [12]. F. Kmura, K. Takashna, et al., Modfed quadratc dscrmnant functons and the applcaton to Chnese character recognton, IEEE Trans. Pattern Analyss and Machne Intellgence, vol. 9, no. 1, pp. 149 153, 1987. [13]. SCUT-COUCH, Webste: http://www.hclab.net/data/ 866