Journal of Information, Control and Management Systems, Vol. 5, (2007), No. 1 25 TRANSCRIPTION OF NUMERICAL OBJETCS TO TEXT FOR SLOVAK LANGUAGE Ján GENČI Technical university of Košice, Faculty of Electrical Engineering and Informatics, Slovak Republic e-mail: genci@tuke.sk Abstract Text-to-speech systems become widely used. However, these systems have to be tuned for every single language, because of language specifics. As usually these specifics are on the side of generation of speech based on text. However, there can be some objects in the text, which need specific handling before speech synthesis. One category of such objects represents numerical objects. Paper presents handling of numerical objects and their transformation to text suitable for text-to-speech system developed for Slovak language. We present process starting from identification of category of numerical object, transformation every specific category to text and adjusting transformation according morphological needs of Slovak language. Keywords: text-to-speech systems, transformation of numerical objects, morphology analysis 1 INTRODUCTION Text to speech systems become widely used. They are invaluable tools e.g. for blind or vision impairs people. However, these systems have to be tuned for every single language, because of language specifics. These specifics, usually, are in the of speech process generation. However, there can be some non-pure-textual objects in the text, which need specific handling in the process of transformation to text too. One category of such objects represents numerical objects. Currently, process of transformation of numbers to text is solved in many applications, mainly in the area of office processing and accounting. It is quite straightforward process. However, these applications need only simple transformation, usually for check printing. If we require correct transformation regarding morphology, transformation approach used in most of the current systems is completely insufficient for inflectional languages like the Slovak language and process becomes more complicated. Texts usually contain many types of numerical objects (e.g. numbers, dates, time, sport matches results etc.). Even every category can be expressed in several forms.
26 Transcription of Numerical Objects to Text for Slovak Language Each one of them need special approach e.g. number can be expressed in the text as integer, real number, fraction etc. Moreover, such numerical objects can be found in different context and in the different context they should be treated differently. Our goal is to determine correct morphological information for numerical objects according context and transform these objects in the appropriate textual form. The whole process of transformation can be divided to three phases: analysis of text identification of type of numerical objects; morphology analysis identification and analysis of words influencing transformation (words just in the front and behind of transformed numerical object); transformation itself generation of appropriate textual form of numerical object using information gain in the previous steps. Bellow we describe the foundation of the transformation process. 2 TYPE OF NUMERICAL OBJECTS IN THE TEXTS To determine type of numerical obejcts we use our local corpus of Slovak text (mainly journal and newspaper texts available on the Internet) to make analysis of numerical objects in the texts. We indetified numerical object as numbers expressing: basic form of cardinal number (nominative of plural, except number 1); cardinal number with higher values tha 999, containing space (190 000); declined basic form (s 20 domami); connection to next word through dash (26-ročná, 63-minútová); time expression (19:00, 1:29:30,425); ordinal number in the declined form (3. júna, 24. ročníka); interval (12-15 metrov); interval stated by ordinal numbers (v dňoch 1-10 októbra.); real number (0,6); percentage expression (38,23%); date in various forms (1.10.2006, 1. októbra 2006); special grouping of numbers and characters (zákon 625/2004, podujatie seriálu F1); Roman numerals (10. kolo II. ligy); results of sport conduits (2:2). Statistics regarding identified type of numerical objects is presented in the table Tab 1. We determine type of the numerical object by regular expressions. 3 MORPHOLOGICAL ANALYSIS Morphological analysis consists of two steps: analysis of word in front of numerical object (FRONT WORD);
Journal of Information, Control and Management Systems, Vol. 5, (2007), No. 1 27 analysis of word after numerical object (BEHIND WORD). According to our research, FRONT WORD influence morphology of numerical object in the case it is preposition. Prepositions are tight with given gramatical cases [1]. Using regular expression, we determine all possible gramatical cases for the preposition stated in front of the numerical object. Table1 Statistics of numerical objects Journals Newspapers Type of object quantity % quantity % all 71620 100 317765 100 Cardinal number 39280 54,85 115322 36,29 in brackets 1866 2,61 12404 3,9 one bracket only 542 0,76 6018 1,89 date 90 0,12 8389 2,64 time 210 0,29 57685 18,15 sport conduits results 81 0,11 16283 5,12 number word 2780 5,28 8378 2,64 real number 2659 3,71 19177 6,03 Interval 466 0,65 1141 0,36 ordinal numbers 12071 16,85 56280 17,71 Fractions 5905 8,24 3612 1,14 non-continuous (e.g. 12 000) 2758 3,85 5778 1,82 Object area or content 136 0,19 114 0,036 others 1776 2,48 7184 2,26 Morfological analysis of BEHIND WORD is implemented in the two alternative ways. First one use Database of Slovak Language words [2]. This database is full inflectional database with appropriate information about morphological categories. It contains about 80 000 word in the basic form and about 1 700 000 word (basic and inflectioned form). Database was developed at our department. Second way of morphological analysis utilizes morphological resources available on the Internet. It uses MORFEO search engine [3] (similar to Google search engine, but for Slovak pages only and has support for inflection) and Electronic Lexicon of Slovak Language [4] (or alterantively, Short Dictionary of Slovak Language [5]). These resources are used if word is not located in the local Database of Slovak Languauge. MORFEO search engine is used for determination of basic form of the word and consecutively, Electronic Lexicon of Slovak Language (or Short Dictionary of Slovak Language) is used for extraction of morfological data. Collected data gender, paradigm and case are used in the next step of transformation.
28 Transcription of Numerical Objects to Text for Slovak Language 4 TRANSFORMATION TO TEXT Generation of appropriate textual form is the last step in the transformation process. Input data are numerical object itself, its type, grammatical gender and case. We provide following transformation methods (according determined type): 1. cardinal numeral in basic form (male gender, nominative case) if we are not able to determine gender and case; 2. cardinal number in declined form if we are able to determine gender and case; 3. ordinal number declined if the gender and case are determined, else in basic form; 4. math and fraction expressions; 5. date; 6. real number. Each type of object is transformed to text according common rules used for expression number in Slovak language. 5 RESULTS In the following table we provide examples (edited, to shorten text) of transformation of numerical objects to text (Tab 2.). Table 2 Examples of transformation of numerical objects to texts Authentic text V lokalite sa ráta s výstavbou 200 rodinných domov. Transformed text V lokalite sa ráta s výstavbou dvesto rodinných domov.... spôsobiac tak škodu 170 000 dolárov.... spôsobiac tak škodu stosedemdesiattisíc dolárov. Spomína si 26-ročná Košičanka. Spomína si dvadsaťšesť -ročná Košičanka.... ktorému sa pred vypredaným (5 500 divákov) a fantastickým publikom... Cesta vedúca prvou uličkou s 20 domami bude stáť...... čo bolo nahlásené polícii v piatok o 23.24 hod.... ktorému sa pred vypredaným ( päťtisícpäťsto divákov) a fantastickým publikom... Cesta vedúca prvou uličkou s dvadsiatimi domami bude stáť...... čo bolo nahlásené polícii v piatok o dvadsiatej tretej hod. dvadsiatej štvrtej min.... zvíťazila 63-minútová snímka z roku... zvíťazila šesťdesiattri - minútová
Journal of Information, Control and Management Systems, Vol. 5, (2007), No. 1 29 2004 v prevahe... snímka z roku dvetisícštyri v prevahe...... odštartovali netradične, 3. júna v Londýne... Jej 27. ročníka sa zúčastnilo 1 050 ľudí, čo...... muža nezistenej totožnosti vo veku okolo 60 70 rokov.... klesla reálna mzda o 0,6 percenta, zadlženosť zo 170 miliárd korún v roku 1998 stúpla na dnešných 600 miliárd, a Slovensko...... informovať o nových cenách platných od 1.10.2005 a o novej výške...... nový zákon o energetike č. 656/2004, ktorý... Víťazom 17. podujatia tohtoročného seriálu majstrovstiev sveta F1, sa stal... Zvolen v 47. min. ťažil z presilovky - na 2:2 vyrovnal Majeský. Tento brejk potom s prehľadom potvrdil (6:3). odštartovali netradične, tretieho júna v Londýne... Jej dvadsiateho siedmeho ročníka sa zúčastnilo tisícpäťdesiat ľudí, čo...... muža nezistenej totožnosti vo veku okolo šesťdesiat až sedemdesiat rokov.... klesla reálna mzda o nula celých šesť desatín percenta, zadlženosť zo stosedemdesiatich miliárd korún v roku tisícdeväťstodeväťdesiatosem stúpla na dnešných šesťsto miliárd, a Slovensko...... informovať o nových cenách platných od prvého októbra dvetisícpäť a o novej výške...... nový zákon o energetike č. šesťstopäťdesiatšesť lomené dvetisícštyri, ktorý... Víťazom sedemnásteho podujatia tohtoročného seriálu majstrovstiev sveta F jeden, sa stal... Zvolen v štyridsiatej siedmej min. ťažil z presilovky - na dva ku dvom vyrovnal Majeský. Tento brejk potom s prehľadom potvrdil ( šesť ku trom ). 6 CONCLUSION AND FUTURE WORK As a result of our work we designed system, which is able to rewrite most of the object containing numerals to plain text according morphological rules of Slovak language. First version of the system was implemented as a diploma work [6]. In the near future we plan extensively test the system, together with its further development. The main goal is to provide public interface to the system using web interface or system based on web services architecture. REFERENCES [1] DVONČ, L. HORÁK, G. RUŽIČKA, J. a i...: Morfológia slovenského jazyka. 1. vydanie. Bratislava : Vydavateľstvo SAV, 1966. 896 s. [2] ŠPIRENG, P.: Databáza slov slovenského jazyka Agnitio 3. Semestrálny projekt KPI FEI TU Košice, 2005 [3] http://www.morfeo.sk
30 Transcription of Numerical Objects to Text for Slovak Language [4] Electronic Lexicon of Slovak Language. http://www.slex.sk [5] Short Dictionary of Slovak Language. http://kssj.juls.savba.sk [6] Šoltésová, A.: Transkripcia číselných výrazov na text so zohľadnením morfologickej informácie. Diplomma work. KPI FEI Technical University of Košice, 2006.