CROSS LANGUAGE INFORMATION ACCESS IN TELUGU

Size: px

Start display at page:

Download "CROSS LANGUAGE INFORMATION ACCESS IN TELUGU"

Melvyn Newton
5 years ago
Views:

1 CROSS LANGUAGE INFORMATION ACCESS IN TELUGU by Vasudeva Varma, Aditya Mogadala Mogadala, V. Srikanth Reddy, Ram Bhupal Reddy in Siliconandhrconference (Global Internet forum for Telugu) Report No: IIIT/TR/2011/-1 Centre for Search and Information Extraction Lab International Institute of Information Technology Hyderabad , INDIA September 2011

2 CROSS LANGUAGE INFORMATION ACCESS IN TELUGU Vasudeva Varma*, Aditya Mogadala, Srikanth Reddy, Bhupal Reddy Search and Information Extraction Lab IIIT- Hyderabad, India Abstract This paper describes a large scale system for Cross Language Information Access (CLIA) which will help accessing information available in a language that is different from the language in which a query or information need is expressed. We specifically focus on Telugu to English and Hindi CLIA system. For a query given in Telugu, this system retrieves information available in English and Hindi. It also tries to present that information back in Telugu using several information access technologies such as: Information Extraction, Summarization and Machine Translation. This system was developed at IIIT Hyderabad, in the context of a larger project funded by Ministry of Communication and Information Technology executed by a consortium of ten Universities and research organizations covering six Indian languages for input and English and Hindi as target languages. CLIA systems can be considered as extension to Cross Language Information Retrieval (CLIR) systems. They process of the results obtained by the CLIR system in the target language. Users unfamiliar with the language of documents returned using CLIR are often unable to extract relevant information from these documents. This requires further processing which might include producing a summary of the multiple documents retrieved, translating such summary back to the source or the query language, extracting structured information from the retrieved documents and then producing human consumable information nuggets, and translating the entire or the relevant portions of the document. In order to address above- mentioned needs for building efficient CLIA systems one requires fully automatic high quality translation system at different levels of information processing. This paper details the pre- retrieval and post- retrieval processing tasks where various challenges like query translation, language identification, software engineering issues involved in building end to end systems. I. INTRODUCTION Information Access is the process of making the information available in various documents accessible and usable to the user who have a specific information need. The documents may be of various media, formats, document sources or even languages. Information Retrieval (IR) technologies enable information access by retrieving a set of ranked documents that are likely to be relevant to the information need of the user. However, IR is only a part of the information access puzzle. Role of IR technologies ends once the relevant documents are obtained. After the results are obtained, the user needs to skim through the documents, judge the relevance of these documents, compare them against each other, find out relevant portions of the document that might satisfy their information need, extract elements of the text that provide answers, and perhaps summarize multiple documents or portions of the documents. All this requires processing of world knowledge. Information Access technologies are expected to provide these functionalities that is more cognitive in nature. CLIR can be seen also as a technology that combines both Information Retrieval and Machine Translation (MT) and extend to the languages other than English. The structure of CLIR system is broken down into categories like Indexing (IR), translation, ranking and matching. Oard & Dorr [1] work is perhaps the first attempt to study various approaches and techniques of CLIR in a detailed manner and showed that CLIR is not exactly the combination of IR and MT and somewhere between them. IR focuses on retrieving the relevant documents given a query by a human user and MT aims at producing single accurate target equivalent of a given input text. CLIR s functionality may be a hybrid of both these systems in the sense that the IR engine of CLIR system can take multiple translations produced by a program (as opposed to human user s single query) and the translation system tries to process not so complete input 1

3 text (for example, just named entities, multiword expressions or simple sentence fragments) and produce multiple target language equivalents focusing less on producing grammatically correct translations. CLIA further improves this information and provide more meaningful representation of the information. In this paper we first describe the overall architecture of our CLIA system in Section 2. We see some preprocessing aspects like language identification during crawling in Section 3. Input processing aspects like query translation, transliteration and disambiguation issues will be mentioned in Section 4.Then we will see some of the challenges and building aspects of the end to end Telugu CLIA system in Section 5. We explain the experimental results obtained in each of the modules mentioned. Conclusion and Future Work will be mentioned in Section 7. II. ARCHITECTURE OF THE CLIA SYSTEM The functionality of the CLIA system can be broken down into two main phases A) Offline Processing and B) Online Processing. A. Offline Processing The offline processing phase encapsulates the following tasks crawling, parsing and indexing. 1) Crawling This task simply refers to the process of crawling the web and fetching all the documents found for further processing. The crawl starts by taking as input a set of starting URLs called as Seed URLs. Once these URLs are fetched, the out links from these documents are added to the queue and the crawler fetches them too. This process is repeated until the queue is empty or a present limit is reached. 2) Parsing The parsing phase is an important part of the offline phase. All processing of the fetched documents is done in this phase. The CLIA system does the following tasks during this phase not necessarily in the specified order. Font Transcoding: Conversion of web pages with non- standard or proprietary character or font encoding into standard UTF- 8. Language Identification: Identify the writing language of the document. Any language specific offline processing is done based on the language attribute of the document. Domain Identification: Categorize the webpage into one of the pre- decided domains. Current system supports only Tourism. This information is used for domain specific processing like deciding upon the IE template to be filled. Named Entity and Multi- word Expression Extraction: NEs and MWEs in the document are marked so as to facilitate other modules to utilize this information. Summary Generation: An extractive summary of the document is generated to be presented to the user in the search interface. Information Extraction: Domain specific IE templates are populated in this module for providing information to the user in a tabulated form. 3) Indexing 2

4 In the indexing phase, the various fields of a document are indexed and stored as an inverted index readily searchable. The important fields that are included in the index are the document title, content, language, domain and named entities. Document content is analyzed which includes stop word removal, stemming etc before indexing it. The most important modules in the offline processing phase are the Language Identification module and the Domain Identification module. These are discussed in detail in sections III and IV. B. Online Processing Online processing refers to all the tasks that are done from the time the user has fired a search query to the time the user sees the results page. A brief note on all the online processing tasks of CLIA is provided below. Named Entity and Multi- Word Expression extraction: A light weight NER and MWE analyzer is used to extract the named entities and multi word expressions so that they will be directly searched in the index without any further analysis like stemming. Query Processing: For a monolingual IR system this includes query reformulation to improve recall. For a cross lingual IR system this includes conversion of query from one language into another. Retrieval and Ranking: Retrieve relevant documents from the index and rank them as per their importance. Snippet generation: Generating a small excerpt from the document highlighting the query context so that the user can take a decision whether to visit the webpage or not. Snippet translation: This is an important feature of the CLIA system wherein the snippet is translated into the user s language for easier understanding. Query focused summary generation: Generating summary of the chosen document focusing on the query context. The entire functionality and performance of the CLIA system is dependant on the Query Processing module. This is discussed in detail in Section V. III. LANGUAGE IDENTIFICATION Language identification is a task of identifying the language of a page to which it belongs. There can be multiple languages present in a page, but categorizing it to the most dominating language present in the page is important. In the CLIA systems to index the pages of different languages initially during the crawling phase, language identification plays a major role. When a page is downloading during crawling, we initially identify the encoding of the font. If the font is proprietary then font is converted to UTF- 8 before doing the language identification. The language identifier categorizes the following Indian languages mainly Hindi, Tamil, Bengali, Telugu, Marathi and Punjabi in a document. For these it takes external data in the form of n- grams and dictionary of words in each language as input to train the n- gram 3

5 profiling model. When the UTF- 8 encoded page content is given to the language identifier it returns the Language ID for each of these Indian languages Hindi, Bengali, Tamil, Marathi, Punjabi and Telugu. IV. INPUT PROCESSING Input Processing involves post offline processing tasks that handle user inputs and feedback. Listed below are some of the non- trivial tasks in building efficient CLIA system. Query processing, which includes translation and transliteration, pre- translation query expansion and post translation query refinement and expansion, named entity and multiword expression recognition in query, word sense disambiguation are some of the tasks involved. Below we will discuss about query translation, transliteration and disambiguation aspects. A. Query Translation Query words which are other than named entities (NE) and multi-word expressions (MWE) needs to be translated from Indian language to the cross language i.e. Hindi and English to do further processing on the query. In order to translate them to cross language there is need of bilingual dictionaries in each of the Indian languages to English and Hindi. There are bilingual dictionaries stored for each of the Indian languages mentioned earlier consisting at least 20k parallel words for translation. When the query is fired in an Indian language the words other than NE s and MWE s will be searched through the bilingual dictionaries for exact translation. Then the query is reformed in the cross language before the query is fired in the search index. B. Query Transliteration Transliteration, the pronunciation- based translation or keyword- based translation from a source language to a desired target language, is important to many cross language and natural languages processing tasks. Also success of any CLIA system depends on efficiency of its transliteration of queries. When not carefully handled, the mean average precision (MAP) can reach 50% [2].Generally these queries may contain MWE s, NE s or general language terms. The challenge for the system is to identify the type of query terms and decide accordingly whether to translate or transliterate. Also, mining appropriate transliterations from the top results of the first- pass retrieval should be considered as it achieves enhanced cross lingual performance of the system overall, in addition to enhancing individual performance of more queries [3]. There are some drawbacks that need to be handled if transliteration does not produce the exact spelling variants used in the document collection. An approximate string matching technique is usually required to alleviate this drawback. By doing so, retrieval performance should not be compromised. If inaccurate or confusing transliterations occur in multiple alternative transliterations [4] appropriate action should be taken. Also, if translated or transliterated query is human readable it ensures proper transliteration and ultimately provides good search results. For transliterating NE s in the first step we took word- aligned bilingual corpus while for the second step statistics over the alignments to transliterate the source language word and generate the desired number of target language words is done. For this Hidden Markov Model 4

6 (HMM) alignment and Conditional Random Fields (CRFs), a discriminative model is used. HMM alignment maximizes the probability of the observed word pairs using the expectation maximization algorithm and then the character level alignments (n- gram) are set to maximum posterior predictions of the model while CRF uses forward Viterbi and backward A* search whose combination produce the exact n- best results. For HMM alignment GIZA++ is used while CRF++ for training the model. The list of named entity words identified that need to be transliterated is taken and using the trained model top 3 probable words is extracted. C. Query Disambiguation Basic research is needed to investigate the relationship between sense ambiguity, disambiguation, and information retrieval. Understanding the influence of ambiguity and disambiguation on a probabilistic IR system is required to improve the retrieval results. Query ambiguity is only problematic to an IR system when it is retrieving from very short queries [5]. Also word senses in a query has to resolve to high degree of accuracy. For CLIA, this creates much more complex problems as it involves understanding the query in multiple languages. To determine the sense of a word, a word sense disambiguation (WSD) algorithm typically should understand the context of the ambiguous word, external resources such as machine- readable dictionaries, or a combination of both. Although dictionaries provide useful word sense information and thesaurus provide additional information about relationships between words, they lack pragmatic information as unavailability of them in corpora. But there are major barriers in building a high- performing query disambiguation system as it includes difficulty of labeling data and predicting fine- grained sense distinctions. User queries can be diversified into different topics and areas and could mean differently in each of those areas. This creates problem of word disambiguation. Thus any system that attempts to retrieve good results has to determine the sense of a word from contextual features. V. END TO END TELUGU CLIA SYSTEM Our system focuses mainly on three languages: Telugu, Hindi and English. We analyze the performance of each language separately. In this section, we present an end to end separate CLIA system which is divided into different levels and modules. These levels consist of (a) Crawling, (b) Indexing and finally (C) Query processing, searching and ranking. We segregate each stage and explain the sub steps that were addressed at each level. A. Crawling A Crawler is a program, which browses the World Wide Web or the given document collection in a methodical automated manner and creates a copy of all the visited pages for later processing. In this CLIA system we have domain identifier [6] crawling system that looks for pages that fit a particular description. The pages belonging to certain selected topics, or satisfying certain criteria, are indexed and maintained. The crawler is guided by a classifier to recognize the relevance of the pages crawled. In this system we have specified domains, which crawls only Tourism and Health domains and also classifies the Indian languages. For this machine learning based SVM classifier is used. 5

7 B. Language Decoding As we are building system for Indian languages that have multiple encodings in each language needs to be addressed. Good number of language encodings is in usage on Internet and other document repositories. Especially in Indian language document repositories, these document collections are available in a set of national and international standards for Indian language character set encodings, while a number of publishers use proprietary non- standard encodings. For example Telugu language has some proprietary fonts like Hemalatah, Tikkana, Vaartha, Karthika etc.in order to be able to index such content, it is very important to identify the language and such encodings to achieve a better recall in retrieval by having much broader coverage. We use a set of heuristic based techniques using fonts and character ranges to recognize character encodings such as UTF- 8, ISCII or other proprietary encodings from HTML and PDF documents [7]. C. Indexing An indexer module should be capable of building data structures, which enable fast retrieval of relevant documents and also enable ranking of these documents. The design of data structure for index should also be sensitive to transaction aspects such as locking of indices for writes, updatability of index without difficulty etc. In a Cross- lingual IR task using a bilingual dictionary along with term co- occurrence statistics and language modeling approach [8] helps improve the functional IR performance. We used an augmented index model which is used for fast retrieval while having the benefits of language modeling in a CLIR task. This model is capable of retrieval and ranking with or without query expansion techniques using term collocation statistics of the indexed corpus [9]. D. Input Processing Query processing is important step for input processing in improving recall and precision of any CLIA system. Query pre- processing needs to be done to improve efficiency in retrieval. In this CLIA system query processing module uses n- gram based translation of query words using bi- lingual lexicons. Identification of named entities is done by Named Entity Recognizers. If translation is not possible transliteration of the identified named entities is been carried out. Also a Boolean query- scoring sub- module used to weigh Boolean queries to generate the output of the query- processing module. We briefly discuss some of the query processing steps performed in building this CLIA system for Telugu, Hindi and English. E. Query Processing List of stop words are prepared which is used to remove stop words from the query before translation or transliteration. Since Telugu is a highly agglutinative language and in some cases a whole phrase or a sentence from English could be translated into a single word in Telugu. For this we used some shallow rules [10] that can remove some common prefixes and suffixes. Some of the very common prefixes and suffixes are removed from a given Telugu 6

8 word. A given word is stemmed from these prefixes and suffixes recursively until a stem is obtained. Some of the other sub tasks in query pruning are listed below. Named Entities identification Named Entity Recognition is the task of identifying and classifying all proper nouns in a document as person names, organization names, location names, date & time expressions and miscellaneous. Conditional Random Fields (CRFs) [11] with two Indian languages Telugu and Hindi [12, 13] is used with the character n- gram based models for this task. Query translation using lexicons Resources containing English- Hindi and English- Telugu cross language dictionaries were primarily used in English to Indian language query translation. Available English- Hindi dictionary was conveniently formatted for machine processing; however the English- Telugu dictionary was a digitized version of a human readable dictionary. In order to convert the human readable dictionary to machine processing form, a set of regular expressions were used [7]. Transliteration Transliteration task is considered to be non- trivial as its output decides the quality of the results retrieved in CLIA systems. The basic principle followed was to compress the source word into its minimal form and align it across an indexed list of target language words to arrive at the top n- equivalents based on the modified edit distance. This Mapping model was used instead of alignment- based model as it ensures fastness. To speed up the process of arriving at the right match in the target index, Compressed Word Format (CWFs) of the target named entities and indexing the CWFs along with their actual forms is done. When any new query word arrives for transliteration CWF form is to be generated and is to be compared with the CWF of the source query word with the CWFs of the target language. Entries in the index of the matching entries with modified Levenshtein distance equal to zero will be returned. Query Scoring Query scoring helps to rank the retrieved results. Once the source language queries were translated and transliterated, the resultant target language keywords is to be used for constructing Boolean queries using the OR operator. If a particular keyword occurs in multiple sections of the query, it has to be given greater score compared to the other keywords accordingly. Hence, cumulative weight for each word to be calculated based on the number of occurrences [14]. Other important keywords like years, numbers, are also given higher weight factors. The translated query word in the Boolean query and the scoring weight are associated and the final query output for a given source language query from the system would be of the formed and used accordingly. F. Snippet Generation Snippet is a short summary of each document retrieved by the CLIA system for a user query. To retrieve the content of the retrieved document from search engine query words are given to the CLIA system. 7

9 In our system the query words will be highlighted in the snippet produced and then forwarded to the snippet Translation module. Extracted snippet is considered efficient if the sentence extraction has language specific sentence boundary delimiter task. For this title extracted using the title tags obtained from the HTML source file or from the document files is checked for accuracy. Each complete sentence will be checked for the presence of the query words in question. Sentences are ranked based on the number of query words, key words and domain ontology words present in the sentence. The generated short summary or snippet contains the context (five to ten words depending on the number of query words) of the highlighted query words subject to a predetermined length of the snippet. If no complete sentences are identified, the menu list on the web page will be listed as the snippet of the web page. G. Retrieval and Ranking Probabilistic retrieval algorithm BM25 [15] is used for cross lingual and monolingual information retrieval. It is a ranking function used to rank matching documents according to their relevance to a given search query. It uses bagof-words approach for retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of the inter-relationship between the query terms within a document. It is not a single function, but actually a whole family of scoring functions, with slightly different components and parameters. VI. EXPERIMENTAL RESULTS The evaluation of language identifier, Input processing and end- to- end CLIA system mentioned in earlier sections consists of analyzing and evaluating quantitatively to find the performance and issues in various components of the system. The evaluation framework will also dependent on the construction of benchmark data for the Telugu and Hindi languages in accordance with the requirements. Evaluation here depends on two major evaluations one is on methods and metrics and other is on creating benchmark data. Two of the most widely used measures of document retrieval effectiveness are Recall and Precision. Recall measures how well a system retrieves all the relevant documents; and Precision, how well the system retrieves only the relevant documents [16]. Precision scores have been evaluated for Language Identifier for each of the languages Hindi, Bengali, Marathi, Telugu, Punjabi and Tamil. Table 1 describes the module accuracy in identifying the languages. TABLE I. LANGUAGE IDENTIFIER ACCURACY Language Accuracy Telugu 98% Hindi 99% Tamil 99% Punjabi 97% Marati 98% Bengali 99% Input processing have sub tasks involved which needs to be evaluated separately. We will see the results obtained in the input processing task in end to end Telugu CLIA system. 8

10 End to End Telugu CLIA system has many sub tasks involved we will see evaluations done on each of them separately. Crawling Crawling was done to index only tourism domain related websites. Table 2 shows the accuracy achieved by the crawler in identifying the tourism related websites for Hindi and Telugu. TABLE II. DOMAIN IDENTIFIER Language Accuracy Telugu 83.5% Hindi 66% NER identification in corpus The corpus in total consisted of tokens out of which were NE s. Corpus is divided into training and testing sets. The training set consisted of tokens out of which 8485 were NE s. The testing set consisted of tokens out of which 2407 were NE s. A word window 6 words at a time having parts of speech (POS), suffix and prefix gave F- Measure 44.91% for Hindi. Another approach using n- grams has been performed which attained F- Measure of 44.57% for Hindi if n=4 and 49.62% for Telugu if n=3. Translation of Query Telugu to Hindi and Telugu to English Bilingual dictionaries has been developed which has 46k entries. Transliteration of Query Transliteration methods mentioned in the earlier sections using HMM and CRF models produced overall 87% precision when generated models are of Telugu to Hindi alignments and 83% for Telugu to English. Romanization method also tested which produced lower accuracies compared to HMM and CRF model. Overall system was evaluated using the P@5 and P@10 scores for 5 queries.table 3 below shows the scores for each of the queries. TABLE III. P@5 AND P@10 FOR 5 QUERIES FROM TELUGU-HINDI Query No. P@5 P@

11 TABLE IV. P@5 AND P@10 FOR 5 QUERIES FROM TELUGU-ENGLISH Query No. P@5 P@ VII. CONCLUSION AND FUTURE WORK We have described the CLIA systems pre processing and post processing tasks along with language identification and other specific tasks for building CLIA Telugu system. We understood that if the system handles information between pair of languages then they are called cross language systems. All cross language information access systems essentially benefit from the Information Retrieval technologies, which enable information access by retrieving a set of ranked documents that are likely to be relevant to the information. However, the role of IR technologies ends once the relevant documents are obtained. But there are other problems when dealing with CLIA systems. They require query translation/translation, NE identification, results ranking, which have dependencies on lexicons having good coverage, bilingual dictionaries, and parallel corpora for efficient information retrieval. In this paper, we have presented major challenges in building those CLIA systems and also we described end to end Telugu CLIA system. Also we will continuously working on the improvement on the existing system using various state of art approaches in input and output processing stages which constitutes our future work. ACKNOWLEDGMENT We acknowledge the funding we have received from Ministry of Communication and Information Technology (MCIT), Government of India and Cross Language Information Access (CLIA) consortium members helped us. REFERENCES [1] Oard, D., & Dorr, B. (1996). A Survey of Multilingual Text Retrieval. Technical Report UMIACS-TR-96-19, University of Maryland, Institute for Advanced Computer Studies. [2] Larkey, L., AbdulJaleel, N., & Connell, M. (2003).What s in a Name? : Proper Names in Arabic Cross Language Information Retrieval. CIIR Technical Report, IR-278, Univ. of Amherst. [3] Saravanan, K., Udupa, R., & Kumaran, A. (2010). Cross lingual Information Retrieval System Enhanced with Transliteration Generation and Mining. In Proceedings of Forum for Information Retrieval Evaluation (FIRE-2010) Workshop, Kolkata, India. 10

12 [4] Ea-Ee Jan, Shih-Hsiang Lin, & Berlin Chen. (2010). Transliteration Retrieval Model for Cross Lingual Information Retrieval. Lecture Notes in Computer Science, 2010, Volume 6458, Information Retrieval Technology, pp [5] Sanderson, M. (1994).Word sense disambiguation and information retrieval. In Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, Dublin, Ireland, p [6] Chakrabarti, S., Dom, B., & van den Berg, M. (1999). Focused Crawling: A New Approach for Topic-Specific Resource Discovery. In Proc. 8th World Wide Web Conf., Elsevier Science, Amsterdam, pp [7] Pingali, P., & Varma, V. (2006).Hindi and Telugu to English Cross Language Information Retrieval.Working Notes of Cross Language Evaluation Forum Workshop, Spain. [8] Pingali P., Kula K., T., & Varma, V. (2007). Hindi, Telugu, Oromo, English CLIR Evaluation. Evaluation of Multilingual and Multi-modal Information Retrieval. Vol. 4730, 2007, ISBN , Springer-Verlag. [9] Pingali, P., & Varma, V. (2007). Multilingual Indexing Support for CLIR using Language Modeling. In Bulletin of the IEEE Computer Society Technical Committee on Data Engineering. [10] Pingali, P., Jagarlamudi, J., & Varma, V. (2006). A Dictionary Based Approach with Query Expansion to Cross Language Query Based Multi-Document Summarization: Experiments in Telugu English. National Workshop on Artificial Intelligence, Mumbai, India. [11] Wallach H., M. (2004). Conditional random fields: An introduction. Technical Report MS-CIS-04-21, University of Pennsylvania, Department of Computer and Information Science, University of Pennsylvania. [12] Shishtla, P., M., Pingali, P., & Varma, V. (2008). A Character n-gram Based Approach for Improved Recall in Indian Language NER. In Proceedings of IJCNLP-08 Workshop on Named Entity Recognition for South and South East Asian Languages, Hyderabad, India, pp [13] Shishtla, P., M., Pingali, P., & Varma, V. (2008). Experiments in Telugu NER: A Conditional Random Field Approach. In Proceedings of IJCNLP-08 Workshop on Named Entity Recognition for South and South East Asian Languages, Hyderabad, India, pp [14] Ballesteros, L., & Croft W., B. (1997). Phrasal translation and query expansion techniques for cross-language information retrieval. In Proceedings of the 20th Annual International ACM Conference on Research and Development in Information Retrieval, Philadelphia, PA, July 27 31, pp [15] Robertson, S., & Zaragoza, H. (2009).Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval. pp [16] Saracevic, T. (1975). Relevance: A review of and a framework for thinking on the notion in information science. Journal of the American Society for Information Science and Technology, pp

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Some Issues in Application of NLP to Intelligent