CROSS LANGUAGE INFORMATION ACCESS IN TELUGU

Size: px
Start display at page:

Download "CROSS LANGUAGE INFORMATION ACCESS IN TELUGU"

Transcription

1 CROSS LANGUAGE INFORMATION ACCESS IN TELUGU by Vasudeva Varma, Aditya Mogadala Mogadala, V. Srikanth Reddy, Ram Bhupal Reddy in Siliconandhrconference (Global Internet forum for Telugu) Report No: IIIT/TR/2011/-1 Centre for Search and Information Extraction Lab International Institute of Information Technology Hyderabad , INDIA September 2011

2 CROSS LANGUAGE INFORMATION ACCESS IN TELUGU Vasudeva Varma*, Aditya Mogadala, Srikanth Reddy, Bhupal Reddy Search and Information Extraction Lab IIIT- Hyderabad, India Abstract This paper describes a large scale system for Cross Language Information Access (CLIA) which will help accessing information available in a language that is different from the language in which a query or information need is expressed. We specifically focus on Telugu to English and Hindi CLIA system. For a query given in Telugu, this system retrieves information available in English and Hindi. It also tries to present that information back in Telugu using several information access technologies such as: Information Extraction, Summarization and Machine Translation. This system was developed at IIIT Hyderabad, in the context of a larger project funded by Ministry of Communication and Information Technology executed by a consortium of ten Universities and research organizations covering six Indian languages for input and English and Hindi as target languages. CLIA systems can be considered as extension to Cross Language Information Retrieval (CLIR) systems. They process of the results obtained by the CLIR system in the target language. Users unfamiliar with the language of documents returned using CLIR are often unable to extract relevant information from these documents. This requires further processing which might include producing a summary of the multiple documents retrieved, translating such summary back to the source or the query language, extracting structured information from the retrieved documents and then producing human consumable information nuggets, and translating the entire or the relevant portions of the document. In order to address above- mentioned needs for building efficient CLIA systems one requires fully automatic high quality translation system at different levels of information processing. This paper details the pre- retrieval and post- retrieval processing tasks where various challenges like query translation, language identification, software engineering issues involved in building end to end systems. I. INTRODUCTION Information Access is the process of making the information available in various documents accessible and usable to the user who have a specific information need. The documents may be of various media, formats, document sources or even languages. Information Retrieval (IR) technologies enable information access by retrieving a set of ranked documents that are likely to be relevant to the information need of the user. However, IR is only a part of the information access puzzle. Role of IR technologies ends once the relevant documents are obtained. After the results are obtained, the user needs to skim through the documents, judge the relevance of these documents, compare them against each other, find out relevant portions of the document that might satisfy their information need, extract elements of the text that provide answers, and perhaps summarize multiple documents or portions of the documents. All this requires processing of world knowledge. Information Access technologies are expected to provide these functionalities that is more cognitive in nature. CLIR can be seen also as a technology that combines both Information Retrieval and Machine Translation (MT) and extend to the languages other than English. The structure of CLIR system is broken down into categories like Indexing (IR), translation, ranking and matching. Oard & Dorr [1] work is perhaps the first attempt to study various approaches and techniques of CLIR in a detailed manner and showed that CLIR is not exactly the combination of IR and MT and somewhere between them. IR focuses on retrieving the relevant documents given a query by a human user and MT aims at producing single accurate target equivalent of a given input text. CLIR s functionality may be a hybrid of both these systems in the sense that the IR engine of CLIR system can take multiple translations produced by a program (as opposed to human user s single query) and the translation system tries to process not so complete input 1

3 text (for example, just named entities, multiword expressions or simple sentence fragments) and produce multiple target language equivalents focusing less on producing grammatically correct translations. CLIA further improves this information and provide more meaningful representation of the information. In this paper we first describe the overall architecture of our CLIA system in Section 2. We see some preprocessing aspects like language identification during crawling in Section 3. Input processing aspects like query translation, transliteration and disambiguation issues will be mentioned in Section 4.Then we will see some of the challenges and building aspects of the end to end Telugu CLIA system in Section 5. We explain the experimental results obtained in each of the modules mentioned. Conclusion and Future Work will be mentioned in Section 7. II. ARCHITECTURE OF THE CLIA SYSTEM The functionality of the CLIA system can be broken down into two main phases A) Offline Processing and B) Online Processing. A. Offline Processing The offline processing phase encapsulates the following tasks crawling, parsing and indexing. 1) Crawling This task simply refers to the process of crawling the web and fetching all the documents found for further processing. The crawl starts by taking as input a set of starting URLs called as Seed URLs. Once these URLs are fetched, the out links from these documents are added to the queue and the crawler fetches them too. This process is repeated until the queue is empty or a present limit is reached. 2) Parsing The parsing phase is an important part of the offline phase. All processing of the fetched documents is done in this phase. The CLIA system does the following tasks during this phase not necessarily in the specified order. Font Transcoding: Conversion of web pages with non- standard or proprietary character or font encoding into standard UTF- 8. Language Identification: Identify the writing language of the document. Any language specific offline processing is done based on the language attribute of the document. Domain Identification: Categorize the webpage into one of the pre- decided domains. Current system supports only Tourism. This information is used for domain specific processing like deciding upon the IE template to be filled. Named Entity and Multi- word Expression Extraction: NEs and MWEs in the document are marked so as to facilitate other modules to utilize this information. Summary Generation: An extractive summary of the document is generated to be presented to the user in the search interface. Information Extraction: Domain specific IE templates are populated in this module for providing information to the user in a tabulated form. 3) Indexing 2

4 In the indexing phase, the various fields of a document are indexed and stored as an inverted index readily searchable. The important fields that are included in the index are the document title, content, language, domain and named entities. Document content is analyzed which includes stop word removal, stemming etc before indexing it. The most important modules in the offline processing phase are the Language Identification module and the Domain Identification module. These are discussed in detail in sections III and IV. B. Online Processing Online processing refers to all the tasks that are done from the time the user has fired a search query to the time the user sees the results page. A brief note on all the online processing tasks of CLIA is provided below. Named Entity and Multi- Word Expression extraction: A light weight NER and MWE analyzer is used to extract the named entities and multi word expressions so that they will be directly searched in the index without any further analysis like stemming. Query Processing: For a monolingual IR system this includes query reformulation to improve recall. For a cross lingual IR system this includes conversion of query from one language into another. Retrieval and Ranking: Retrieve relevant documents from the index and rank them as per their importance. Snippet generation: Generating a small excerpt from the document highlighting the query context so that the user can take a decision whether to visit the webpage or not. Snippet translation: This is an important feature of the CLIA system wherein the snippet is translated into the user s language for easier understanding. Query focused summary generation: Generating summary of the chosen document focusing on the query context. The entire functionality and performance of the CLIA system is dependant on the Query Processing module. This is discussed in detail in Section V. III. LANGUAGE IDENTIFICATION Language identification is a task of identifying the language of a page to which it belongs. There can be multiple languages present in a page, but categorizing it to the most dominating language present in the page is important. In the CLIA systems to index the pages of different languages initially during the crawling phase, language identification plays a major role. When a page is downloading during crawling, we initially identify the encoding of the font. If the font is proprietary then font is converted to UTF- 8 before doing the language identification. The language identifier categorizes the following Indian languages mainly Hindi, Tamil, Bengali, Telugu, Marathi and Punjabi in a document. For these it takes external data in the form of n- grams and dictionary of words in each language as input to train the n- gram 3

5 profiling model. When the UTF- 8 encoded page content is given to the language identifier it returns the Language ID for each of these Indian languages Hindi, Bengali, Tamil, Marathi, Punjabi and Telugu. IV. INPUT PROCESSING Input Processing involves post offline processing tasks that handle user inputs and feedback. Listed below are some of the non- trivial tasks in building efficient CLIA system. Query processing, which includes translation and transliteration, pre- translation query expansion and post translation query refinement and expansion, named entity and multiword expression recognition in query, word sense disambiguation are some of the tasks involved. Below we will discuss about query translation, transliteration and disambiguation aspects. A. Query Translation Query words which are other than named entities (NE) and multi-word expressions (MWE) needs to be translated from Indian language to the cross language i.e. Hindi and English to do further processing on the query. In order to translate them to cross language there is need of bilingual dictionaries in each of the Indian languages to English and Hindi. There are bilingual dictionaries stored for each of the Indian languages mentioned earlier consisting at least 20k parallel words for translation. When the query is fired in an Indian language the words other than NE s and MWE s will be searched through the bilingual dictionaries for exact translation. Then the query is reformed in the cross language before the query is fired in the search index. B. Query Transliteration Transliteration, the pronunciation- based translation or keyword- based translation from a source language to a desired target language, is important to many cross language and natural languages processing tasks. Also success of any CLIA system depends on efficiency of its transliteration of queries. When not carefully handled, the mean average precision (MAP) can reach 50% [2].Generally these queries may contain MWE s, NE s or general language terms. The challenge for the system is to identify the type of query terms and decide accordingly whether to translate or transliterate. Also, mining appropriate transliterations from the top results of the first- pass retrieval should be considered as it achieves enhanced cross lingual performance of the system overall, in addition to enhancing individual performance of more queries [3]. There are some drawbacks that need to be handled if transliteration does not produce the exact spelling variants used in the document collection. An approximate string matching technique is usually required to alleviate this drawback. By doing so, retrieval performance should not be compromised. If inaccurate or confusing transliterations occur in multiple alternative transliterations [4] appropriate action should be taken. Also, if translated or transliterated query is human readable it ensures proper transliteration and ultimately provides good search results. For transliterating NE s in the first step we took word- aligned bilingual corpus while for the second step statistics over the alignments to transliterate the source language word and generate the desired number of target language words is done. For this Hidden Markov Model 4

6 (HMM) alignment and Conditional Random Fields (CRFs), a discriminative model is used. HMM alignment maximizes the probability of the observed word pairs using the expectation maximization algorithm and then the character level alignments (n- gram) are set to maximum posterior predictions of the model while CRF uses forward Viterbi and backward A* search whose combination produce the exact n- best results. For HMM alignment GIZA++ is used while CRF++ for training the model. The list of named entity words identified that need to be transliterated is taken and using the trained model top 3 probable words is extracted. C. Query Disambiguation Basic research is needed to investigate the relationship between sense ambiguity, disambiguation, and information retrieval. Understanding the influence of ambiguity and disambiguation on a probabilistic IR system is required to improve the retrieval results. Query ambiguity is only problematic to an IR system when it is retrieving from very short queries [5]. Also word senses in a query has to resolve to high degree of accuracy. For CLIA, this creates much more complex problems as it involves understanding the query in multiple languages. To determine the sense of a word, a word sense disambiguation (WSD) algorithm typically should understand the context of the ambiguous word, external resources such as machine- readable dictionaries, or a combination of both. Although dictionaries provide useful word sense information and thesaurus provide additional information about relationships between words, they lack pragmatic information as unavailability of them in corpora. But there are major barriers in building a high- performing query disambiguation system as it includes difficulty of labeling data and predicting fine- grained sense distinctions. User queries can be diversified into different topics and areas and could mean differently in each of those areas. This creates problem of word disambiguation. Thus any system that attempts to retrieve good results has to determine the sense of a word from contextual features. V. END TO END TELUGU CLIA SYSTEM Our system focuses mainly on three languages: Telugu, Hindi and English. We analyze the performance of each language separately. In this section, we present an end to end separate CLIA system which is divided into different levels and modules. These levels consist of (a) Crawling, (b) Indexing and finally (C) Query processing, searching and ranking. We segregate each stage and explain the sub steps that were addressed at each level. A. Crawling A Crawler is a program, which browses the World Wide Web or the given document collection in a methodical automated manner and creates a copy of all the visited pages for later processing. In this CLIA system we have domain identifier [6] crawling system that looks for pages that fit a particular description. The pages belonging to certain selected topics, or satisfying certain criteria, are indexed and maintained. The crawler is guided by a classifier to recognize the relevance of the pages crawled. In this system we have specified domains, which crawls only Tourism and Health domains and also classifies the Indian languages. For this machine learning based SVM classifier is used. 5

7 B. Language Decoding As we are building system for Indian languages that have multiple encodings in each language needs to be addressed. Good number of language encodings is in usage on Internet and other document repositories. Especially in Indian language document repositories, these document collections are available in a set of national and international standards for Indian language character set encodings, while a number of publishers use proprietary non- standard encodings. For example Telugu language has some proprietary fonts like Hemalatah, Tikkana, Vaartha, Karthika etc.in order to be able to index such content, it is very important to identify the language and such encodings to achieve a better recall in retrieval by having much broader coverage. We use a set of heuristic based techniques using fonts and character ranges to recognize character encodings such as UTF- 8, ISCII or other proprietary encodings from HTML and PDF documents [7]. C. Indexing An indexer module should be capable of building data structures, which enable fast retrieval of relevant documents and also enable ranking of these documents. The design of data structure for index should also be sensitive to transaction aspects such as locking of indices for writes, updatability of index without difficulty etc. In a Cross- lingual IR task using a bilingual dictionary along with term co- occurrence statistics and language modeling approach [8] helps improve the functional IR performance. We used an augmented index model which is used for fast retrieval while having the benefits of language modeling in a CLIR task. This model is capable of retrieval and ranking with or without query expansion techniques using term collocation statistics of the indexed corpus [9]. D. Input Processing Query processing is important step for input processing in improving recall and precision of any CLIA system. Query pre- processing needs to be done to improve efficiency in retrieval. In this CLIA system query processing module uses n- gram based translation of query words using bi- lingual lexicons. Identification of named entities is done by Named Entity Recognizers. If translation is not possible transliteration of the identified named entities is been carried out. Also a Boolean query- scoring sub- module used to weigh Boolean queries to generate the output of the query- processing module. We briefly discuss some of the query processing steps performed in building this CLIA system for Telugu, Hindi and English. E. Query Processing List of stop words are prepared which is used to remove stop words from the query before translation or transliteration. Since Telugu is a highly agglutinative language and in some cases a whole phrase or a sentence from English could be translated into a single word in Telugu. For this we used some shallow rules [10] that can remove some common prefixes and suffixes. Some of the very common prefixes and suffixes are removed from a given Telugu 6

8 word. A given word is stemmed from these prefixes and suffixes recursively until a stem is obtained. Some of the other sub tasks in query pruning are listed below. Named Entities identification Named Entity Recognition is the task of identifying and classifying all proper nouns in a document as person names, organization names, location names, date & time expressions and miscellaneous. Conditional Random Fields (CRFs) [11] with two Indian languages Telugu and Hindi [12, 13] is used with the character n- gram based models for this task. Query translation using lexicons Resources containing English- Hindi and English- Telugu cross language dictionaries were primarily used in English to Indian language query translation. Available English- Hindi dictionary was conveniently formatted for machine processing; however the English- Telugu dictionary was a digitized version of a human readable dictionary. In order to convert the human readable dictionary to machine processing form, a set of regular expressions were used [7]. Transliteration Transliteration task is considered to be non- trivial as its output decides the quality of the results retrieved in CLIA systems. The basic principle followed was to compress the source word into its minimal form and align it across an indexed list of target language words to arrive at the top n- equivalents based on the modified edit distance. This Mapping model was used instead of alignment- based model as it ensures fastness. To speed up the process of arriving at the right match in the target index, Compressed Word Format (CWFs) of the target named entities and indexing the CWFs along with their actual forms is done. When any new query word arrives for transliteration CWF form is to be generated and is to be compared with the CWF of the source query word with the CWFs of the target language. Entries in the index of the matching entries with modified Levenshtein distance equal to zero will be returned. Query Scoring Query scoring helps to rank the retrieved results. Once the source language queries were translated and transliterated, the resultant target language keywords is to be used for constructing Boolean queries using the OR operator. If a particular keyword occurs in multiple sections of the query, it has to be given greater score compared to the other keywords accordingly. Hence, cumulative weight for each word to be calculated based on the number of occurrences [14]. Other important keywords like years, numbers, are also given higher weight factors. The translated query word in the Boolean query and the scoring weight are associated and the final query output for a given source language query from the system would be of the formed and used accordingly. F. Snippet Generation Snippet is a short summary of each document retrieved by the CLIA system for a user query. To retrieve the content of the retrieved document from search engine query words are given to the CLIA system. 7

9 In our system the query words will be highlighted in the snippet produced and then forwarded to the snippet Translation module. Extracted snippet is considered efficient if the sentence extraction has language specific sentence boundary delimiter task. For this title extracted using the title tags obtained from the HTML source file or from the document files is checked for accuracy. Each complete sentence will be checked for the presence of the query words in question. Sentences are ranked based on the number of query words, key words and domain ontology words present in the sentence. The generated short summary or snippet contains the context (five to ten words depending on the number of query words) of the highlighted query words subject to a predetermined length of the snippet. If no complete sentences are identified, the menu list on the web page will be listed as the snippet of the web page. G. Retrieval and Ranking Probabilistic retrieval algorithm BM25 [15] is used for cross lingual and monolingual information retrieval. It is a ranking function used to rank matching documents according to their relevance to a given search query. It uses bagof-words approach for retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of the inter-relationship between the query terms within a document. It is not a single function, but actually a whole family of scoring functions, with slightly different components and parameters. VI. EXPERIMENTAL RESULTS The evaluation of language identifier, Input processing and end- to- end CLIA system mentioned in earlier sections consists of analyzing and evaluating quantitatively to find the performance and issues in various components of the system. The evaluation framework will also dependent on the construction of benchmark data for the Telugu and Hindi languages in accordance with the requirements. Evaluation here depends on two major evaluations one is on methods and metrics and other is on creating benchmark data. Two of the most widely used measures of document retrieval effectiveness are Recall and Precision. Recall measures how well a system retrieves all the relevant documents; and Precision, how well the system retrieves only the relevant documents [16]. Precision scores have been evaluated for Language Identifier for each of the languages Hindi, Bengali, Marathi, Telugu, Punjabi and Tamil. Table 1 describes the module accuracy in identifying the languages. TABLE I. LANGUAGE IDENTIFIER ACCURACY Language Accuracy Telugu 98% Hindi 99% Tamil 99% Punjabi 97% Marati 98% Bengali 99% Input processing have sub tasks involved which needs to be evaluated separately. We will see the results obtained in the input processing task in end to end Telugu CLIA system. 8

10 End to End Telugu CLIA system has many sub tasks involved we will see evaluations done on each of them separately. Crawling Crawling was done to index only tourism domain related websites. Table 2 shows the accuracy achieved by the crawler in identifying the tourism related websites for Hindi and Telugu. TABLE II. DOMAIN IDENTIFIER Language Accuracy Telugu 83.5% Hindi 66% NER identification in corpus The corpus in total consisted of tokens out of which were NE s. Corpus is divided into training and testing sets. The training set consisted of tokens out of which 8485 were NE s. The testing set consisted of tokens out of which 2407 were NE s. A word window 6 words at a time having parts of speech (POS), suffix and prefix gave F- Measure 44.91% for Hindi. Another approach using n- grams has been performed which attained F- Measure of 44.57% for Hindi if n=4 and 49.62% for Telugu if n=3. Translation of Query Telugu to Hindi and Telugu to English Bilingual dictionaries has been developed which has 46k entries. Transliteration of Query Transliteration methods mentioned in the earlier sections using HMM and CRF models produced overall 87% precision when generated models are of Telugu to Hindi alignments and 83% for Telugu to English. Romanization method also tested which produced lower accuracies compared to HMM and CRF model. Overall system was evaluated using the P@5 and P@10 scores for 5 queries.table 3 below shows the scores for each of the queries. TABLE III. P@5 AND P@10 FOR 5 QUERIES FROM TELUGU-HINDI Query No. P@5 P@

11 TABLE IV. P@5 AND P@10 FOR 5 QUERIES FROM TELUGU-ENGLISH Query No. P@5 P@ VII. CONCLUSION AND FUTURE WORK We have described the CLIA systems pre processing and post processing tasks along with language identification and other specific tasks for building CLIA Telugu system. We understood that if the system handles information between pair of languages then they are called cross language systems. All cross language information access systems essentially benefit from the Information Retrieval technologies, which enable information access by retrieving a set of ranked documents that are likely to be relevant to the information. However, the role of IR technologies ends once the relevant documents are obtained. But there are other problems when dealing with CLIA systems. They require query translation/translation, NE identification, results ranking, which have dependencies on lexicons having good coverage, bilingual dictionaries, and parallel corpora for efficient information retrieval. In this paper, we have presented major challenges in building those CLIA systems and also we described end to end Telugu CLIA system. Also we will continuously working on the improvement on the existing system using various state of art approaches in input and output processing stages which constitutes our future work. ACKNOWLEDGMENT We acknowledge the funding we have received from Ministry of Communication and Information Technology (MCIT), Government of India and Cross Language Information Access (CLIA) consortium members helped us. REFERENCES [1] Oard, D., & Dorr, B. (1996). A Survey of Multilingual Text Retrieval. Technical Report UMIACS-TR-96-19, University of Maryland, Institute for Advanced Computer Studies. [2] Larkey, L., AbdulJaleel, N., & Connell, M. (2003).What s in a Name? : Proper Names in Arabic Cross Language Information Retrieval. CIIR Technical Report, IR-278, Univ. of Amherst. [3] Saravanan, K., Udupa, R., & Kumaran, A. (2010). Cross lingual Information Retrieval System Enhanced with Transliteration Generation and Mining. In Proceedings of Forum for Information Retrieval Evaluation (FIRE-2010) Workshop, Kolkata, India. 10

12 [4] Ea-Ee Jan, Shih-Hsiang Lin, & Berlin Chen. (2010). Transliteration Retrieval Model for Cross Lingual Information Retrieval. Lecture Notes in Computer Science, 2010, Volume 6458, Information Retrieval Technology, pp [5] Sanderson, M. (1994).Word sense disambiguation and information retrieval. In Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, Dublin, Ireland, p [6] Chakrabarti, S., Dom, B., & van den Berg, M. (1999). Focused Crawling: A New Approach for Topic-Specific Resource Discovery. In Proc. 8th World Wide Web Conf., Elsevier Science, Amsterdam, pp [7] Pingali, P., & Varma, V. (2006).Hindi and Telugu to English Cross Language Information Retrieval.Working Notes of Cross Language Evaluation Forum Workshop, Spain. [8] Pingali P., Kula K., T., & Varma, V. (2007). Hindi, Telugu, Oromo, English CLIR Evaluation. Evaluation of Multilingual and Multi-modal Information Retrieval. Vol. 4730, 2007, ISBN , Springer-Verlag. [9] Pingali, P., & Varma, V. (2007). Multilingual Indexing Support for CLIR using Language Modeling. In Bulletin of the IEEE Computer Society Technical Committee on Data Engineering. [10] Pingali, P., Jagarlamudi, J., & Varma, V. (2006). A Dictionary Based Approach with Query Expansion to Cross Language Query Based Multi-Document Summarization: Experiments in Telugu English. National Workshop on Artificial Intelligence, Mumbai, India. [11] Wallach H., M. (2004). Conditional random fields: An introduction. Technical Report MS-CIS-04-21, University of Pennsylvania, Department of Computer and Information Science, University of Pennsylvania. [12] Shishtla, P., M., Pingali, P., & Varma, V. (2008). A Character n-gram Based Approach for Improved Recall in Indian Language NER. In Proceedings of IJCNLP-08 Workshop on Named Entity Recognition for South and South East Asian Languages, Hyderabad, India, pp [13] Shishtla, P., M., Pingali, P., & Varma, V. (2008). Experiments in Telugu NER: A Conditional Random Field Approach. In Proceedings of IJCNLP-08 Workshop on Named Entity Recognition for South and South East Asian Languages, Hyderabad, India, pp [14] Ballesteros, L., & Croft W., B. (1997). Phrasal translation and query expansion techniques for cross-language information retrieval. In Proceedings of the 20th Annual International ACM Conference on Research and Development in Information Retrieval, Philadelphia, PA, July 27 31, pp [15] Robertson, S., & Zaragoza, H. (2009).Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval. pp [16] Saracevic, T. (1975). Relevance: A review of and a framework for thinking on the notion in information science. Journal of the American Society for Information Science and Technology, pp

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Some Issues in Application of NLP to Intelligent

More information

Multilingual Information Retrieval

Multilingual Information Retrieval Proposal for Tutorial on Multilingual Information Retrieval Proposed by Arjun Atreya V Shehzaad Dhuliawala ShivaKarthik S Swapnil Chaudhari Under the direction of Prof. Pushpak Bhattacharyya Department

More information

Research Article. August 2017

Research Article. August 2017 International Journals of Advanced Research in Computer Science and Software Engineering ISSN: 2277-128X (Volume-7, Issue-8) a Research Article August 2017 English-Marathi Cross Language Information Retrieval

More information

Developing Focused Crawlers for Genre Specific Search Engines

Developing Focused Crawlers for Genre Specific Search Engines Developing Focused Crawlers for Genre Specific Search Engines Nikhil Priyatam Thesis Advisor: Prof. Vasudeva Varma IIIT Hyderabad July 7, 2014 Examples of Genre Specific Search Engines MedlinePlus Naukri.com

More information

DCU at FIRE 2013: Cross-Language!ndian News Story Search

DCU at FIRE 2013: Cross-Language!ndian News Story Search DCU at FIRE 2013: Cross-Language!ndian News Story Search Piyush Arora, Jennifer Foster, and Gareth J. F. Jones CNGL Centre for Global Intelligent Content School of Computing, Dublin City University Glasnevin,

More information

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,

More information

AN UNSUPERVISED APPROACH TO DEVELOP IR SYSTEM: THE CASE OF URDU

AN UNSUPERVISED APPROACH TO DEVELOP IR SYSTEM: THE CASE OF URDU AN UNSUPERVISED APPROACH TO DEVELOP IR SYSTEM: THE CASE OF URDU ABSTRACT Mohd. Shahid Husain Integral University, Lucknow Web Search Engines are best gifts to the mankind by Information and Communication

More information

Chapter 2. Architecture of a Search Engine

Chapter 2. Architecture of a Search Engine Chapter 2 Architecture of a Search Engine Search Engine Architecture A software architecture consists of software components, the interfaces provided by those components and the relationships between them

More information

THE WEB SEARCH ENGINE

THE WEB SEARCH ENGINE International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) Vol.1, Issue 2 Dec 2011 54-60 TJPRC Pvt. Ltd., THE WEB SEARCH ENGINE Mr.G. HANUMANTHA RAO hanu.abc@gmail.com

More information

Information Retrieval

Information Retrieval Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,

More information

This paper studies methods to enhance cross-language retrieval of domain-specific

This paper studies methods to enhance cross-language retrieval of domain-specific Keith A. Gatlin. Enhancing Cross-Language Retrieval of Comparable Corpora Through Thesaurus-Based Translation and Citation Indexing. A master s paper for the M.S. in I.S. degree. April, 2005. 23 pages.

More information

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine International Journal of Scientific & Engineering Research Volume 2, Issue 12, December-2011 1 Web Search Engine G.Hanumantha Rao*, G.NarenderΨ, B.Srinivasa Rao+, M.Srilatha* Abstract This paper explains

More information

Domain-specific Concept-based Information Retrieval System

Domain-specific Concept-based Information Retrieval System Domain-specific Concept-based Information Retrieval System L. Shen 1, Y. K. Lim 1, H. T. Loh 2 1 Design Technology Institute Ltd, National University of Singapore, Singapore 2 Department of Mechanical

More information

3 Publishing Technique

3 Publishing Technique Publishing Tool 32 3 Publishing Technique As discussed in Chapter 2, annotations can be extracted from audio, text, and visual features. The extraction of text features from the audio layer is the approach

More information

Information Retrieval

Information Retrieval Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have

More information

TEXT CHAPTER 5. W. Bruce Croft BACKGROUND

TEXT CHAPTER 5. W. Bruce Croft BACKGROUND 41 CHAPTER 5 TEXT W. Bruce Croft BACKGROUND Much of the information in digital library or digital information organization applications is in the form of text. Even when the application focuses on multimedia

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

Automatic Bangla Corpus Creation

Automatic Bangla Corpus Creation Automatic Bangla Corpus Creation Asif Iqbal Sarkar, Dewan Shahriar Hossain Pavel and Mumit Khan BRAC University, Dhaka, Bangladesh asif@bracuniversity.net, pavel@bracuniversity.net, mumit@bracuniversity.net

More information

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN: IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, 20131 Improve Search Engine Relevance with Filter session Addlin Shinney R 1, Saravana Kumar T

More information

1.

1. * 390/0/2 : 389/07/20 : 2 25-8223 ( ) 2 25-823 ( ) ISC SCOPUS L ISA http://jist.irandoc.ac.ir 390 22-97 - :. aminnezarat@gmail.com mosavit@pnu.ac.ir : ( ).... 00.. : 390... " ". ( )...2 2. 3. 4 Google..

More information

Cross Lingual Query Dependent Snippet Generation

Cross Lingual Query Dependent Snippet Generation Cross Lingual Query Dependent Snippet Generation Pinaki Bhaskar, Sivaji Bandyopadhyay Computer Science and Engineering Department, Jadavpur University Kolkata 700032, India Abstract The present paper describes

More information

MIRACLE at ImageCLEFmed 2008: Evaluating Strategies for Automatic Topic Expansion

MIRACLE at ImageCLEFmed 2008: Evaluating Strategies for Automatic Topic Expansion MIRACLE at ImageCLEFmed 2008: Evaluating Strategies for Automatic Topic Expansion Sara Lana-Serrano 1,3, Julio Villena-Román 2,3, José C. González-Cristóbal 1,3 1 Universidad Politécnica de Madrid 2 Universidad

More information

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) CONTEXT SENSITIVE TEXT SUMMARIZATION USING HIERARCHICAL CLUSTERING ALGORITHM

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) CONTEXT SENSITIVE TEXT SUMMARIZATION USING HIERARCHICAL CLUSTERING ALGORITHM INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & 6367(Print), ISSN 0976 6375(Online) Volume 3, Issue 1, January- June (2012), TECHNOLOGY (IJCET) IAEME ISSN 0976 6367(Print) ISSN 0976 6375(Online) Volume

More information

ResPubliQA 2010

ResPubliQA 2010 SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first

More information

Cross-Language Information Retrieval using Dutch Query Translation

Cross-Language Information Retrieval using Dutch Query Translation Cross-Language Information Retrieval using Dutch Query Translation Anne R. Diekema and Wen-Yuan Hsiao Syracuse University School of Information Studies 4-206 Ctr. for Science and Technology Syracuse, NY

More information

A Model for Information Retrieval Agent System Based on Keywords Distribution

A Model for Information Retrieval Agent System Based on Keywords Distribution A Model for Information Retrieval Agent System Based on Keywords Distribution Jae-Woo LEE Dept of Computer Science, Kyungbok College, 3, Sinpyeong-ri, Pocheon-si, 487-77, Gyeonggi-do, Korea It2c@koreaackr

More information

Cross lingual Information Retrieval

Cross lingual Information Retrieval Cross lingual Information Retrieval Chapter 1. CLIR and its challenges A large amount of information in the form of text, audio, video and other documents is available on the web. Users should be able

More information

What is this Song About?: Identification of Keywords in Bollywood Lyrics

What is this Song About?: Identification of Keywords in Bollywood Lyrics What is this Song About?: Identification of Keywords in Bollywood Lyrics by Drushti Apoorva G, Kritik Mathur, Priyansh Agrawal, Radhika Mamidi in 19th International Conference on Computational Linguistics

More information

Part I: Data Mining Foundations

Part I: Data Mining Foundations Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?

More information

Ontology-Based Web Query Classification for Research Paper Searching

Ontology-Based Web Query Classification for Research Paper Searching Ontology-Based Web Query Classification for Research Paper Searching MyoMyo ThanNaing University of Technology(Yatanarpon Cyber City) Mandalay,Myanmar Abstract- In web search engines, the retrieval of

More information

Cross Lingual Information Retrieval Using Data Mining Methods

Cross Lingual Information Retrieval Using Data Mining Methods Association for Information Systems AIS Electronic Library (AISeL) AMCIS 2009 Proceedings Americas Conference on Information Systems (AMCIS) 2009 Cross Lingual Information Retrieval Using Data Mining Methods

More information

A Practical Passage-based Approach for Chinese Document Retrieval

A Practical Passage-based Approach for Chinese Document Retrieval A Practical Passage-based Approach for Chinese Document Retrieval Szu-Yuan Chi 1, Chung-Li Hsiao 1, Lee-Feng Chien 1,2 1. Department of Information Management, National Taiwan University 2. Institute of

More information

Web Information Retrieval using WordNet

Web Information Retrieval using WordNet Web Information Retrieval using WordNet Jyotsna Gharat Asst. Professor, Xavier Institute of Engineering, Mumbai, India Jayant Gadge Asst. Professor, Thadomal Shahani Engineering College Mumbai, India ABSTRACT

More information

Search Engines Information Retrieval in Practice

Search Engines Information Retrieval in Practice Search Engines Information Retrieval in Practice W. BRUCE CROFT University of Massachusetts, Amherst DONALD METZLER Yahoo! Research TREVOR STROHMAN Google Inc. ----- PEARSON Boston Columbus Indianapolis

More information

Handwritten Script Recognition at Block Level

Handwritten Script Recognition at Block Level Chapter 4 Handwritten Script Recognition at Block Level -------------------------------------------------------------------------------------------------------------------------- Optical character recognition

More information

Data and Information Integration: Information Extraction

Data and Information Integration: Information Extraction International OPEN ACCESS Journal Of Modern Engineering Research (IJMER) Data and Information Integration: Information Extraction Varnica Verma 1 1 (Department of Computer Science Engineering, Guru Nanak

More information

Focused crawling: a new approach to topic-specific Web resource discovery. Authors

Focused crawling: a new approach to topic-specific Web resource discovery. Authors Focused crawling: a new approach to topic-specific Web resource discovery Authors Soumen Chakrabarti Martin van den Berg Byron Dom Presented By: Mohamed Ali Soliman m2ali@cs.uwaterloo.ca Outline Why Focused

More information

Learning to find transliteration on the Web

Learning to find transliteration on the Web Learning to find transliteration on the Web Chien-Cheng Wu Department of Computer Science National Tsing Hua University 101 Kuang Fu Road, Hsin chu, Taiwan d9283228@cs.nthu.edu.tw Jason S. Chang Department

More information

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets Arjumand Younus 1,2, Colm O Riordan 1, and Gabriella Pasi 2 1 Computational Intelligence Research Group,

More information

Don't Use a Lot When Little Will Do : Genre Identi cation Using URLs

Don't Use a Lot When Little Will Do : Genre Identi cation Using URLs Don't Use a Lot When Little Will Do : Genre Identi cation Using URLs by Nikhil Priyatam, Srinivasan Iyengar, Krish Perumal, Vasudeva Varma in 14th International Conference on Intelligent Text Processing

More information

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far

More information

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How

More information

An Approach To Web Content Mining

An Approach To Web Content Mining An Approach To Web Content Mining Nita Patil, Chhaya Das, Shreya Patanakar, Kshitija Pol Department of Computer Engg. Datta Meghe College of Engineering, Airoli, Navi Mumbai Abstract-With the research

More information

Text Mining: A Burgeoning technology for knowledge extraction

Text Mining: A Burgeoning technology for knowledge extraction Text Mining: A Burgeoning technology for knowledge extraction 1 Anshika Singh, 2 Dr. Udayan Ghosh 1 HCL Technologies Ltd., Noida, 2 University School of Information &Communication Technology, Dwarka, Delhi.

More information

Sense-based Information Retrieval System by using Jaccard Coefficient Based WSD Algorithm

Sense-based Information Retrieval System by using Jaccard Coefficient Based WSD Algorithm ISBN 978-93-84468-0-0 Proceedings of 015 International Conference on Future Computational Technologies (ICFCT'015 Singapore, March 9-30, 015, pp. 197-03 Sense-based Information Retrieval System by using

More information

Automatic Text Summarization System Using Extraction Based Technique

Automatic Text Summarization System Using Extraction Based Technique Automatic Text Summarization System Using Extraction Based Technique 1 Priyanka Gonnade, 2 Disha Gupta 1,2 Assistant Professor 1 Department of Computer Science and Engineering, 2 Department of Computer

More information

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES Mu. Annalakshmi Research Scholar, Department of Computer Science, Alagappa University, Karaikudi. annalakshmi_mu@yahoo.co.in Dr. A.

More information

Context Based Indexing in Search Engines: A Review

Context Based Indexing in Search Engines: A Review International Journal of Computer (IJC) ISSN 2307-4523 (Print & Online) Global Society of Scientific Research and Researchers http://ijcjournal.org/ Context Based Indexing in Search Engines: A Review Suraksha

More information

Search Engines. Information Retrieval in Practice

Search Engines. Information Retrieval in Practice Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Web Crawler Finds and downloads web pages automatically provides the collection for searching Web is huge and constantly

More information

Context Based Web Indexing For Semantic Web

Context Based Web Indexing For Semantic Web IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 12, Issue 4 (Jul. - Aug. 2013), PP 89-93 Anchal Jain 1 Nidhi Tyagi 2 Lecturer(JPIEAS) Asst. Professor(SHOBHIT

More information

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A. Knowledge Retrieval Franz J. Kurfess Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A. 1 Acknowledgements This lecture series has been sponsored by the European

More information

Full Text Search in Multi-lingual Documents - A Case Study describing Evolution of the Technology At Spectrum Business Support Ltd.

Full Text Search in Multi-lingual Documents - A Case Study describing Evolution of the Technology At Spectrum Business Support Ltd. Full Text Search in Multi-lingual Documents - A Case Study describing Evolution of the Technology At Spectrum Business Support Ltd. This paper was presented at the ICADL conference December 2001 by Spectrum

More information

Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms

Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms Engineering, Technology & Applied Science Research Vol. 8, No. 1, 2018, 2562-2567 2562 Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms Mrunal S. Bewoor Department

More information

Handling Place References in Text

Handling Place References in Text Handling Place References in Text Introduction Most (geographic) information is available in the form of textual documents Place reference resolution involves two-subtasks: Recognition : Delimiting occurrences

More information

Finding Topic-centric Identified Experts based on Full Text Analysis

Finding Topic-centric Identified Experts based on Full Text Analysis Finding Topic-centric Identified Experts based on Full Text Analysis Hanmin Jung, Mikyoung Lee, In-Su Kang, Seung-Woo Lee, Won-Kyung Sung Information Service Research Lab., KISTI, Korea jhm@kisti.re.kr

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

NUS-I2R: Learning a Combined System for Entity Linking

NUS-I2R: Learning a Combined System for Entity Linking NUS-I2R: Learning a Combined System for Entity Linking Wei Zhang Yan Chuan Sim Jian Su Chew Lim Tan School of Computing National University of Singapore {z-wei, tancl} @comp.nus.edu.sg Institute for Infocomm

More information

Comment Extraction from Blog Posts and Its Applications to Opinion Mining

Comment Extraction from Blog Posts and Its Applications to Opinion Mining Comment Extraction from Blog Posts and Its Applications to Opinion Mining Huan-An Kao, Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan

More information

A Hybrid Unsupervised Web Data Extraction using Trinity and NLP

A Hybrid Unsupervised Web Data Extraction using Trinity and NLP IJIRST International Journal for Innovative Research in Science & Technology Volume 2 Issue 02 July 2015 ISSN (online): 2349-6010 A Hybrid Unsupervised Web Data Extraction using Trinity and NLP Anju R

More information

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer Bing Liu Web Data Mining Exploring Hyperlinks, Contents, and Usage Data With 177 Figures Springer Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web

More information

A Review on Identifying the Main Content From Web Pages

A Review on Identifying the Main Content From Web Pages A Review on Identifying the Main Content From Web Pages Madhura R. Kaddu 1, Dr. R. B. Kulkarni 2 1, 2 Department of Computer Scienece and Engineering, Walchand Institute of Technology, Solapur University,

More information

Information Retrieval Spring Web retrieval

Information Retrieval Spring Web retrieval Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic

More information

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search Basic techniques Text processing; term weighting; vector space model; inverted index; Web Search Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing

More information

CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval

CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval DCU @ CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval Walid Magdy, Johannes Leveling, Gareth J.F. Jones Centre for Next Generation Localization School of Computing Dublin City University,

More information

A Document Graph Based Query Focused Multi- Document Summarizer

A Document Graph Based Query Focused Multi- Document Summarizer A Document Graph Based Query Focused Multi- Document Summarizer By Sibabrata Paladhi and Dr. Sivaji Bandyopadhyay Department of Computer Science and Engineering Jadavpur University Jadavpur, Kolkata India

More information

A User Study on Features Supporting Subjective Relevance for Information Retrieval Interfaces

A User Study on Features Supporting Subjective Relevance for Information Retrieval Interfaces A user study on features supporting subjective relevance for information retrieval interfaces Lee, S.S., Theng, Y.L, Goh, H.L.D., & Foo, S. (2006). Proc. 9th International Conference of Asian Digital Libraries

More information

Noida institute of engineering and technology,greater noida

Noida institute of engineering and technology,greater noida Impact Of Word Sense Ambiguity For English Language In Web IR Prachi Gupta 1, Dr.AnuragAwasthi 2, RiteshRastogi 3 1,2,3 Department of computer Science and engineering, Noida institute of engineering and

More information

Cross-Lingual Information Access and Its Evaluation

Cross-Lingual Information Access and Its Evaluation Cross-Lingual Information Access and Its Evaluation Noriko Kando Research and Development Department National Center for Science Information Systems (NACSIS), Japan URL: http://www.rd.nacsis.ac.jp/~{ntcadm,kando}/

More information

AN EFFECTIVE INFORMATION RETRIEVAL FOR AMBIGUOUS QUERY

AN EFFECTIVE INFORMATION RETRIEVAL FOR AMBIGUOUS QUERY Asian Journal Of Computer Science And Information Technology 2: 3 (2012) 26 30. Contents lists available at www.innovativejournal.in Asian Journal of Computer Science and Information Technology Journal

More information

Information Retrieval. Information Retrieval and Web Search

Information Retrieval. Information Retrieval and Web Search Information Retrieval and Web Search Introduction to IR models and methods Information Retrieval The indexing and retrieval of textual documents. Searching for pages on the World Wide Web is the most recent

More information

NOVEL IMPLEMENTATION OF SEARCH ENGINE FOR TELUGU DOCUMENTS WITH SYLLABLE N- GRAM MODEL

NOVEL IMPLEMENTATION OF SEARCH ENGINE FOR TELUGU DOCUMENTS WITH SYLLABLE N- GRAM MODEL NOVEL IMPLEMENTATION OF SEARCH ENGINE FOR TELUGU DOCUMENTS WITH SYLLABLE N- GRAM MODEL DR.B.PADMAJA RANI* AND DR.A.VINAY BABU 1 *Associate Professor Department of CSE JNTUCEH Hyderabad A.P. India http://jntuceh.ac.in/csstaff.htm

More information

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL STUDIA UNIV. BABEŞ BOLYAI, INFORMATICA, Volume LVII, Number 4, 2012 CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL IOAN BADARINZA AND ADRIAN STERCA Abstract. In this paper

More information

Improvement of Web Search Results using Genetic Algorithm on Word Sense Disambiguation

Improvement of Web Search Results using Genetic Algorithm on Word Sense Disambiguation Volume 3, No.5, May 24 International Journal of Advances in Computer Science and Technology Pooja Bassin et al., International Journal of Advances in Computer Science and Technology, 3(5), May 24, 33-336

More information

Information Retrieval May 15. Web retrieval

Information Retrieval May 15. Web retrieval Information Retrieval May 15 Web retrieval What s so special about the Web? The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically

More information

RLAT Rapid Language Adaptation Toolkit

RLAT Rapid Language Adaptation Toolkit RLAT Rapid Language Adaptation Toolkit Tim Schlippe May 15, 2012 RLAT Rapid Language Adaptation Toolkit - 2 RLAT Rapid Language Adaptation Toolkit RLAT Rapid Language Adaptation Toolkit - 3 Outline Introduction

More information

IJRIM Volume 2, Issue 2 (February 2012) (ISSN )

IJRIM Volume 2, Issue 2 (February 2012) (ISSN ) AN ENHANCED APPROACH TO OPTIMIZE WEB SEARCH BASED ON PROVENANCE USING FUZZY EQUIVALENCE RELATION BY LEMMATIZATION Divya* Tanvi Gupta* ABSTRACT In this paper, the focus is on one of the pre-processing technique

More information

Word Disambiguation in Web Search

Word Disambiguation in Web Search Word Disambiguation in Web Search Rekha Jain Computer Science, Banasthali University, Rajasthan, India Email: rekha_leo2003@rediffmail.com G.N. Purohit Computer Science, Banasthali University, Rajasthan,

More information

Discriminative Training with Perceptron Algorithm for POS Tagging Task

Discriminative Training with Perceptron Algorithm for POS Tagging Task Discriminative Training with Perceptron Algorithm for POS Tagging Task Mahsa Yarmohammadi Center for Spoken Language Understanding Oregon Health & Science University Portland, Oregon yarmoham@ohsu.edu

More information

A Frequent Max Substring Technique for. Thai Text Indexing. School of Information Technology. Todsanai Chumwatana

A Frequent Max Substring Technique for. Thai Text Indexing. School of Information Technology. Todsanai Chumwatana School of Information Technology A Frequent Max Substring Technique for Thai Text Indexing Todsanai Chumwatana This thesis is presented for the Degree of Doctor of Philosophy of Murdoch University May

More information

Recognition of online captured, handwritten Tamil words on Android

Recognition of online captured, handwritten Tamil words on Android Recognition of online captured, handwritten Tamil words on Android A G Ramakrishnan and Bhargava Urala K Medical Intelligence and Language Engineering (MILE) Laboratory, Dept. of Electrical Engineering,

More information

Using Monolingual Clickthrough Data to Build Cross-lingual Search Systems

Using Monolingual Clickthrough Data to Build Cross-lingual Search Systems Using Monolingual Clickthrough Data to Build Cross-lingual Search Systems ABSTRACT Vamshi Ambati Institute for Software Research International Carnegie Mellon University Pittsburgh, PA vamshi@cmu.edu A

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

Domain Based Named Entity Recognition using Naive Bayes

Domain Based Named Entity Recognition using Naive Bayes AUSTRALIAN JOURNAL OF BASIC AND APPLIED SCIENCES ISSN:1991-8178 EISSN: 2309-8414 Journal home page: www.ajbasweb.com Domain Based Named Entity Recognition using Naive Bayes Classification G.S. Mahalakshmi,

More information

Improving Synoptic Querying for Source Retrieval

Improving Synoptic Querying for Source Retrieval Improving Synoptic Querying for Source Retrieval Notebook for PAN at CLEF 2015 Šimon Suchomel and Michal Brandejs Faculty of Informatics, Masaryk University {suchomel,brandejs}@fi.muni.cz Abstract Source

More information

Automatic New Topic Identification in Search Engine Transaction Log Using Goal Programming

Automatic New Topic Identification in Search Engine Transaction Log Using Goal Programming Proceedings of the 2012 International Conference on Industrial Engineering and Operations Management Istanbul, Turkey, July 3 6, 2012 Automatic New Topic Identification in Search Engine Transaction Log

More information

Evaluating the Usefulness of Sentiment Information for Focused Crawlers

Evaluating the Usefulness of Sentiment Information for Focused Crawlers Evaluating the Usefulness of Sentiment Information for Focused Crawlers Tianjun Fu 1, Ahmed Abbasi 2, Daniel Zeng 1, Hsinchun Chen 1 University of Arizona 1, University of Wisconsin-Milwaukee 2 futj@email.arizona.edu,

More information

CADIAL Search Engine at INEX

CADIAL Search Engine at INEX CADIAL Search Engine at INEX Jure Mijić 1, Marie-Francine Moens 2, and Bojana Dalbelo Bašić 1 1 Faculty of Electrical Engineering and Computing, University of Zagreb, Unska 3, 10000 Zagreb, Croatia {jure.mijic,bojana.dalbelo}@fer.hr

More information

Overview of FIRE 2011 Prasenjit Majumder on behalf of the FIRE team

Overview of FIRE 2011 Prasenjit Majumder on behalf of the FIRE team Overview of FIRE 2011 Prasenjit Majumder on behalf of the FIRE team Overview of FIRE 2011 p. 1/21 Overview Background Tasks Data Results Problems and prospects People Overview of FIRE 2011 p. 2/21 Background

More information

Competitive Intelligence and Web Mining:

Competitive Intelligence and Web Mining: Competitive Intelligence and Web Mining: Domain Specific Web Spiders American University in Cairo (AUC) CSCE 590: Seminar1 Report Dr. Ahmed Rafea 2 P age Khalid Magdy Salama 3 P age Table of Contents Introduction

More information

MURDOCH RESEARCH REPOSITORY

MURDOCH RESEARCH REPOSITORY MURDOCH RESEARCH REPOSITORY http://researchrepository.murdoch.edu.au/ This is the author s final version of the work, as accepted for publication following peer review but without the publisher s layout

More information

Searching and Organizing Images Across Languages

Searching and Organizing Images Across Languages Searching and Organizing Images Across Languages Paul Clough University of Sheffield Western Bank Sheffield, UK +44 114 222 2664 p.d.clough@sheffield.ac.uk Mark Sanderson University of Sheffield Western

More information

Extracting Summary from Documents Using K-Mean Clustering Algorithm

Extracting Summary from Documents Using K-Mean Clustering Algorithm Extracting Summary from Documents Using K-Mean Clustering Algorithm Manjula.K.S 1, Sarvar Begum 2, D. Venkata Swetha Ramana 3 Student, CSE, RYMEC, Bellary, India 1 Student, CSE, RYMEC, Bellary, India 2

More information

Web Data Extraction and Alignment Tools: A Survey Pranali Nikam 1 Yogita Gote 2 Vidhya Ghogare 3 Jyothi Rapalli 4

Web Data Extraction and Alignment Tools: A Survey Pranali Nikam 1 Yogita Gote 2 Vidhya Ghogare 3 Jyothi Rapalli 4 IJSRD - International Journal for Scientific Research & Development Vol. 3, Issue 01, 2015 ISSN (online): 2321-0613 Web Data Extraction and Alignment Tools: A Survey Pranali Nikam 1 Yogita Gote 2 Vidhya

More information

Tilburg University. Authoritative re-ranking of search results Bogers, A.M.; van den Bosch, A. Published in: Advances in Information Retrieval

Tilburg University. Authoritative re-ranking of search results Bogers, A.M.; van den Bosch, A. Published in: Advances in Information Retrieval Tilburg University Authoritative re-ranking of search results Bogers, A.M.; van den Bosch, A. Published in: Advances in Information Retrieval Publication date: 2006 Link to publication Citation for published

More information

CHAPTER 5 EXPERT LOCATOR USING CONCEPT LINKING

CHAPTER 5 EXPERT LOCATOR USING CONCEPT LINKING 94 CHAPTER 5 EXPERT LOCATOR USING CONCEPT LINKING 5.1 INTRODUCTION Expert locator addresses the task of identifying the right person with the appropriate skills and knowledge. In large organizations, it

More information

Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching

Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching Wolfgang Tannebaum, Parvaz Madabi and Andreas Rauber Institute of Software Technology and Interactive Systems, Vienna

More information

Effective Tweet Contextualization with Hashtags Performance Prediction and Multi-Document Summarization

Effective Tweet Contextualization with Hashtags Performance Prediction and Multi-Document Summarization Effective Tweet Contextualization with Hashtags Performance Prediction and Multi-Document Summarization Romain Deveaud 1 and Florian Boudin 2 1 LIA - University of Avignon romain.deveaud@univ-avignon.fr

More information

Log Linear Model for String Transformation Using Large Data Sets

Log Linear Model for String Transformation Using Large Data Sets Log Linear Model for String Transformation Using Large Data Sets Mr.G.Lenin 1, Ms.B.Vanitha 2, Mrs.C.K.Vijayalakshmi 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology,

More information

A hybrid method to categorize HTML documents

A hybrid method to categorize HTML documents Data Mining VI 331 A hybrid method to categorize HTML documents M. Khordad, M. Shamsfard & F. Kazemeyni Electrical & Computer Engineering Department, Shahid Beheshti University, Iran Abstract In this paper

More information

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science Information Retrieval CS 6900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Information Retrieval Information Retrieval (IR) is finding material of an unstructured

More information