TRANSCRIPTION OF NUMERICAL OBJETCS TO TEXT FOR SLOVAK LANGUAGE

Similar documents
SOLVING VECTOR OPTIMIZATION PROBLEMS USING OPTIMIZATION TOOLBOX

Influence of Word Normalization on Text Classification

Design and Implementation of Bibliography Registration System

Semantic Video Indexing

Optimizing industry robot for maximum speed with high accuracy

DESIGN OF KOHONEN SELF-ORGANIZING MAP WITH REDUCED STRUCTURE

Computer Kit for Development, Modeling, Simulation and Animation of Mechatronic Systems

Section 3.1(part), Critical Numbers, Extreme Values, Increasing/Decreasing, Concave Up/Down MATH 1190

Converting the Corpus Query Language to the Natural Language

Complexity Analysis of Routing Algorithms in Computer Networks

Information System of Automated Assembly Line on RDBS Oracle base

ALGORITHMS COMPLEXITY AND LINE CLIPPING PROBLEM SOLUTIONS

Mathematics Curriculum Maps for Y1-Y Year 1 Mathematics Curriculum Map. Autumn 1 Autumn 2 Spring 1 Spring 2 Summer 1 Summer 2

Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy

INFORMATION SYSTEM FOR MANAGEMENT AND ANALYSIS OF MEDICAL DATA

From Electronical Questionnaires to Accessible Maths on Web

4 Using The Derivative

New Hash Function Construction for Textual and Geometric Data Retrieval

STAT10010 Introductory Statistics Lab 2

Toward Interlinking Asian Resources Effectively: Chinese to Korean Frequency-Based Machine Translation System

Network Tester: A Generation and Evaluation of Diagnostic Communication in IP Networks

IENG484 Quality Engineering Lab 1 RESEARCH ASSISTANT SHADI BOLOUKIFAR

Token Gazetteer and Character Gazetteer for Named Entity Recognition

Note on list star edge-coloring of subcubic graphs

Frequency Tables. Chapter 500. Introduction. Frequency Tables. Types of Categorical Variables. Data Structure. Missing Values

ERROR: ERROR: ERROR:

AWERProcedia Information Technology & Computer Science

NLP - Based Expert System for Database Design and Development

Two models of the capacitated vehicle routing problem

Automatic Lemmatizer Construction with Focus on OOV Words Lemmatization

The Distributive Property and Expressions Understand how to use the Distributive Property to Clear Parenthesis

ON THE PACKING CHROMATIC NUMBER OF SEMIREGULAR POLYHEDRA

Horn Formulae. CS124 Course Notes 8 Spring 2018

An Approach to Interactive Social Network Geo-Mapping

A.1 Numbers, Sets and Arithmetic

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

CADIAL Search Engine at INEX

Web Applications Usability Testing With Task Model Skeletons

DEVELOPMENT OF A MATHEMATICAL MORPHOLOGY TOOL FOR EDUCATION PURPOSE

Data Quality Control: Using High Performance Binning to Prevent Information Loss

Programming with Python

Editing Pronunciation in Clicker 5

Application of Shortest Path Algorithm to GIS using Fuzzy Logic

TERRESTRIAL LASER SYSTEM TESTING USING REFERENCE BODIES

Tutorial #1: Using Latent GOLD choice to Estimate Discrete Choice Models

Mordebe Admin: A Lexical Management System

MANUAL TO USE THE TEMPLATE FOR WRITING THE THESIS

Reduction of Packet Loss by Optimizing the Antenna System and Layer 3 Coding

Formal Description of Embedded Operating Systems

Mechanics ISSN Transport issue 1, 2008 Communications article 0215

Data analysis using Microsoft Excel

Finding Plagiarism by Evaluating Document Similarities

THE EFFECT OF THE FREE SURFACE ON THE SINGULAR STRESS FIELD AT THE FATIGUE CRACK FRONT

arxiv: v2 [cs.it] 15 Jan 2011

SPECIFICATION-BASED TESTING VIA DOMAIN SPECIFIC LANGUAGE

Analysis of Parallelization Effects on Textual Data Compression

Detecting code re-use potential

ApplMath Lucie Kárná; Štěpán Klapka Message doubling and error detection in the binary symmetrical channel.

Making Content Accessible

Chapter 3: The IF Function and Table Lookup

TYPES OF VARIABLES, STRUCTURE OF DATASETS, AND BASIC STATA LAYOUT

RETRACTED ARTICLE. Web-Based Data Mining in System Design and Implementation. Open Access. Jianhu Gong 1* and Jianzhi Gong 2

Contribution to Multicriterial Classification of Spatial Data

Too Long; Didn t Watch! Extracting Relevant Fragments from Software Development Video Tutorials. Presented by Chris Budiman October 6, 2016

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Identification and Classification of A/E/C Web Sites and Pages

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

Performance of relational database management

November 6, 2014 Allison Kidd, ATRC

To search and summarize on Internet with Human Language Technology

Dynamically Building Facets from Their Search Results

Any Integer Can Be Written as a Fraction

Transaction Management in Fully Temporal System

Adaptive osculatory rational interpolation for image processing

Structures. Dr. Donald Davendra Ph.D. (Department of Computing Science, Structures FEI VSB-TU Ostrava)

Introduction to: Computers & Programming: Strings and Other Sequences

A Data-Mining Approach for Wind Turbine Power Generation Performance Monitoring Based on Power Curve

Database Foundations. 4-1 Oracle SQL Developer Data Modeler. Copyright 2015, Oracle and/or its affiliates. All rights reserved.

Refinement and Formalization of Semi-Formal Use Case Descriptions

Guidelines for the Read Aloud Accommodation

ADL++ Reference Manual Alankar Khara, UNI: ask2206 COMS W4115 winter 2014

A Simple Guide to Using SPSS (Statistical Package for the. Introduction. Steps for Analyzing Data. Social Sciences) for Windows

Journal of Applied Research and Technology ISSN: Centro de Ciencias Aplicadas y Desarrollo Tecnológico.

DEB Platform Deployment

An introduction to SPSS

Multiway Blockwise In-place Merging

Automatized Generating of GUIs for Domain-Specific Languages

Introduction to Lexical Functional Grammar. Wellformedness conditions on f- structures. Constraints on f-structures

Clustering of Data with Mixed Attributes based on Unified Similarity Metric

April 9, 2000 DIS chapter 1

DEVELOPING AND REMOTE CONTROL OF WEB INFORMATION SYSTEMS

Ontology driven voice-based interaction in mobile environment

Exact approach to the tariff zones design problem in public transport

Implementation of Habit sensitive login system An approach to strengthen the login security

AGENDA. o DATA TO INFORMATION TO KNOWLEDGE TO WISDOM onot ALL DATA IS GOOD DATA. o COMPLIANCE AND DATA

Inferring Variable Labels Considering Co-occurrence of Variable Labels in Data Jackets

THE REGULAR PERMUTATION SCHEDULING ON GRAPHS

Formal Figure Formatting Checklist

Small World Properties Generated by a New Algorithm Under Same Degree of All Nodes

Interconnecting Matlab with TwinCAT

Transcription:

Journal of Information, Control and Management Systems, Vol. 5, (2007), No. 1 25 TRANSCRIPTION OF NUMERICAL OBJETCS TO TEXT FOR SLOVAK LANGUAGE Ján GENČI Technical university of Košice, Faculty of Electrical Engineering and Informatics, Slovak Republic e-mail: genci@tuke.sk Abstract Text-to-speech systems become widely used. However, these systems have to be tuned for every single language, because of language specifics. As usually these specifics are on the side of generation of speech based on text. However, there can be some objects in the text, which need specific handling before speech synthesis. One category of such objects represents numerical objects. Paper presents handling of numerical objects and their transformation to text suitable for text-to-speech system developed for Slovak language. We present process starting from identification of category of numerical object, transformation every specific category to text and adjusting transformation according morphological needs of Slovak language. Keywords: text-to-speech systems, transformation of numerical objects, morphology analysis 1 INTRODUCTION Text to speech systems become widely used. They are invaluable tools e.g. for blind or vision impairs people. However, these systems have to be tuned for every single language, because of language specifics. These specifics, usually, are in the of speech process generation. However, there can be some non-pure-textual objects in the text, which need specific handling in the process of transformation to text too. One category of such objects represents numerical objects. Currently, process of transformation of numbers to text is solved in many applications, mainly in the area of office processing and accounting. It is quite straightforward process. However, these applications need only simple transformation, usually for check printing. If we require correct transformation regarding morphology, transformation approach used in most of the current systems is completely insufficient for inflectional languages like the Slovak language and process becomes more complicated. Texts usually contain many types of numerical objects (e.g. numbers, dates, time, sport matches results etc.). Even every category can be expressed in several forms.

26 Transcription of Numerical Objects to Text for Slovak Language Each one of them need special approach e.g. number can be expressed in the text as integer, real number, fraction etc. Moreover, such numerical objects can be found in different context and in the different context they should be treated differently. Our goal is to determine correct morphological information for numerical objects according context and transform these objects in the appropriate textual form. The whole process of transformation can be divided to three phases: analysis of text identification of type of numerical objects; morphology analysis identification and analysis of words influencing transformation (words just in the front and behind of transformed numerical object); transformation itself generation of appropriate textual form of numerical object using information gain in the previous steps. Bellow we describe the foundation of the transformation process. 2 TYPE OF NUMERICAL OBJECTS IN THE TEXTS To determine type of numerical obejcts we use our local corpus of Slovak text (mainly journal and newspaper texts available on the Internet) to make analysis of numerical objects in the texts. We indetified numerical object as numbers expressing: basic form of cardinal number (nominative of plural, except number 1); cardinal number with higher values tha 999, containing space (190 000); declined basic form (s 20 domami); connection to next word through dash (26-ročná, 63-minútová); time expression (19:00, 1:29:30,425); ordinal number in the declined form (3. júna, 24. ročníka); interval (12-15 metrov); interval stated by ordinal numbers (v dňoch 1-10 októbra.); real number (0,6); percentage expression (38,23%); date in various forms (1.10.2006, 1. októbra 2006); special grouping of numbers and characters (zákon 625/2004, podujatie seriálu F1); Roman numerals (10. kolo II. ligy); results of sport conduits (2:2). Statistics regarding identified type of numerical objects is presented in the table Tab 1. We determine type of the numerical object by regular expressions. 3 MORPHOLOGICAL ANALYSIS Morphological analysis consists of two steps: analysis of word in front of numerical object (FRONT WORD);

Journal of Information, Control and Management Systems, Vol. 5, (2007), No. 1 27 analysis of word after numerical object (BEHIND WORD). According to our research, FRONT WORD influence morphology of numerical object in the case it is preposition. Prepositions are tight with given gramatical cases [1]. Using regular expression, we determine all possible gramatical cases for the preposition stated in front of the numerical object. Table1 Statistics of numerical objects Journals Newspapers Type of object quantity % quantity % all 71620 100 317765 100 Cardinal number 39280 54,85 115322 36,29 in brackets 1866 2,61 12404 3,9 one bracket only 542 0,76 6018 1,89 date 90 0,12 8389 2,64 time 210 0,29 57685 18,15 sport conduits results 81 0,11 16283 5,12 number word 2780 5,28 8378 2,64 real number 2659 3,71 19177 6,03 Interval 466 0,65 1141 0,36 ordinal numbers 12071 16,85 56280 17,71 Fractions 5905 8,24 3612 1,14 non-continuous (e.g. 12 000) 2758 3,85 5778 1,82 Object area or content 136 0,19 114 0,036 others 1776 2,48 7184 2,26 Morfological analysis of BEHIND WORD is implemented in the two alternative ways. First one use Database of Slovak Language words [2]. This database is full inflectional database with appropriate information about morphological categories. It contains about 80 000 word in the basic form and about 1 700 000 word (basic and inflectioned form). Database was developed at our department. Second way of morphological analysis utilizes morphological resources available on the Internet. It uses MORFEO search engine [3] (similar to Google search engine, but for Slovak pages only and has support for inflection) and Electronic Lexicon of Slovak Language [4] (or alterantively, Short Dictionary of Slovak Language [5]). These resources are used if word is not located in the local Database of Slovak Languauge. MORFEO search engine is used for determination of basic form of the word and consecutively, Electronic Lexicon of Slovak Language (or Short Dictionary of Slovak Language) is used for extraction of morfological data. Collected data gender, paradigm and case are used in the next step of transformation.

28 Transcription of Numerical Objects to Text for Slovak Language 4 TRANSFORMATION TO TEXT Generation of appropriate textual form is the last step in the transformation process. Input data are numerical object itself, its type, grammatical gender and case. We provide following transformation methods (according determined type): 1. cardinal numeral in basic form (male gender, nominative case) if we are not able to determine gender and case; 2. cardinal number in declined form if we are able to determine gender and case; 3. ordinal number declined if the gender and case are determined, else in basic form; 4. math and fraction expressions; 5. date; 6. real number. Each type of object is transformed to text according common rules used for expression number in Slovak language. 5 RESULTS In the following table we provide examples (edited, to shorten text) of transformation of numerical objects to text (Tab 2.). Table 2 Examples of transformation of numerical objects to texts Authentic text V lokalite sa ráta s výstavbou 200 rodinných domov. Transformed text V lokalite sa ráta s výstavbou dvesto rodinných domov.... spôsobiac tak škodu 170 000 dolárov.... spôsobiac tak škodu stosedemdesiattisíc dolárov. Spomína si 26-ročná Košičanka. Spomína si dvadsaťšesť -ročná Košičanka.... ktorému sa pred vypredaným (5 500 divákov) a fantastickým publikom... Cesta vedúca prvou uličkou s 20 domami bude stáť...... čo bolo nahlásené polícii v piatok o 23.24 hod.... ktorému sa pred vypredaným ( päťtisícpäťsto divákov) a fantastickým publikom... Cesta vedúca prvou uličkou s dvadsiatimi domami bude stáť...... čo bolo nahlásené polícii v piatok o dvadsiatej tretej hod. dvadsiatej štvrtej min.... zvíťazila 63-minútová snímka z roku... zvíťazila šesťdesiattri - minútová

Journal of Information, Control and Management Systems, Vol. 5, (2007), No. 1 29 2004 v prevahe... snímka z roku dvetisícštyri v prevahe...... odštartovali netradične, 3. júna v Londýne... Jej 27. ročníka sa zúčastnilo 1 050 ľudí, čo...... muža nezistenej totožnosti vo veku okolo 60 70 rokov.... klesla reálna mzda o 0,6 percenta, zadlženosť zo 170 miliárd korún v roku 1998 stúpla na dnešných 600 miliárd, a Slovensko...... informovať o nových cenách platných od 1.10.2005 a o novej výške...... nový zákon o energetike č. 656/2004, ktorý... Víťazom 17. podujatia tohtoročného seriálu majstrovstiev sveta F1, sa stal... Zvolen v 47. min. ťažil z presilovky - na 2:2 vyrovnal Majeský. Tento brejk potom s prehľadom potvrdil (6:3). odštartovali netradične, tretieho júna v Londýne... Jej dvadsiateho siedmeho ročníka sa zúčastnilo tisícpäťdesiat ľudí, čo...... muža nezistenej totožnosti vo veku okolo šesťdesiat až sedemdesiat rokov.... klesla reálna mzda o nula celých šesť desatín percenta, zadlženosť zo stosedemdesiatich miliárd korún v roku tisícdeväťstodeväťdesiatosem stúpla na dnešných šesťsto miliárd, a Slovensko...... informovať o nových cenách platných od prvého októbra dvetisícpäť a o novej výške...... nový zákon o energetike č. šesťstopäťdesiatšesť lomené dvetisícštyri, ktorý... Víťazom sedemnásteho podujatia tohtoročného seriálu majstrovstiev sveta F jeden, sa stal... Zvolen v štyridsiatej siedmej min. ťažil z presilovky - na dva ku dvom vyrovnal Majeský. Tento brejk potom s prehľadom potvrdil ( šesť ku trom ). 6 CONCLUSION AND FUTURE WORK As a result of our work we designed system, which is able to rewrite most of the object containing numerals to plain text according morphological rules of Slovak language. First version of the system was implemented as a diploma work [6]. In the near future we plan extensively test the system, together with its further development. The main goal is to provide public interface to the system using web interface or system based on web services architecture. REFERENCES [1] DVONČ, L. HORÁK, G. RUŽIČKA, J. a i...: Morfológia slovenského jazyka. 1. vydanie. Bratislava : Vydavateľstvo SAV, 1966. 896 s. [2] ŠPIRENG, P.: Databáza slov slovenského jazyka Agnitio 3. Semestrálny projekt KPI FEI TU Košice, 2005 [3] http://www.morfeo.sk

30 Transcription of Numerical Objects to Text for Slovak Language [4] Electronic Lexicon of Slovak Language. http://www.slex.sk [5] Short Dictionary of Slovak Language. http://kssj.juls.savba.sk [6] Šoltésová, A.: Transkripcia číselných výrazov na text so zohľadnením morfologickej informácie. Diplomma work. KPI FEI Technical University of Košice, 2006.