CROSS LANGUAGE INFORMATION ACCESS IN TELUGU
|
|
- Melvyn Newton
- 5 years ago
- Views:
Transcription
1 CROSS LANGUAGE INFORMATION ACCESS IN TELUGU by Vasudeva Varma, Aditya Mogadala Mogadala, V. Srikanth Reddy, Ram Bhupal Reddy in Siliconandhrconference (Global Internet forum for Telugu) Report No: IIIT/TR/2011/-1 Centre for Search and Information Extraction Lab International Institute of Information Technology Hyderabad , INDIA September 2011
2 CROSS LANGUAGE INFORMATION ACCESS IN TELUGU Vasudeva Varma*, Aditya Mogadala, Srikanth Reddy, Bhupal Reddy Search and Information Extraction Lab IIIT- Hyderabad, India Abstract This paper describes a large scale system for Cross Language Information Access (CLIA) which will help accessing information available in a language that is different from the language in which a query or information need is expressed. We specifically focus on Telugu to English and Hindi CLIA system. For a query given in Telugu, this system retrieves information available in English and Hindi. It also tries to present that information back in Telugu using several information access technologies such as: Information Extraction, Summarization and Machine Translation. This system was developed at IIIT Hyderabad, in the context of a larger project funded by Ministry of Communication and Information Technology executed by a consortium of ten Universities and research organizations covering six Indian languages for input and English and Hindi as target languages. CLIA systems can be considered as extension to Cross Language Information Retrieval (CLIR) systems. They process of the results obtained by the CLIR system in the target language. Users unfamiliar with the language of documents returned using CLIR are often unable to extract relevant information from these documents. This requires further processing which might include producing a summary of the multiple documents retrieved, translating such summary back to the source or the query language, extracting structured information from the retrieved documents and then producing human consumable information nuggets, and translating the entire or the relevant portions of the document. In order to address above- mentioned needs for building efficient CLIA systems one requires fully automatic high quality translation system at different levels of information processing. This paper details the pre- retrieval and post- retrieval processing tasks where various challenges like query translation, language identification, software engineering issues involved in building end to end systems. I. INTRODUCTION Information Access is the process of making the information available in various documents accessible and usable to the user who have a specific information need. The documents may be of various media, formats, document sources or even languages. Information Retrieval (IR) technologies enable information access by retrieving a set of ranked documents that are likely to be relevant to the information need of the user. However, IR is only a part of the information access puzzle. Role of IR technologies ends once the relevant documents are obtained. After the results are obtained, the user needs to skim through the documents, judge the relevance of these documents, compare them against each other, find out relevant portions of the document that might satisfy their information need, extract elements of the text that provide answers, and perhaps summarize multiple documents or portions of the documents. All this requires processing of world knowledge. Information Access technologies are expected to provide these functionalities that is more cognitive in nature. CLIR can be seen also as a technology that combines both Information Retrieval and Machine Translation (MT) and extend to the languages other than English. The structure of CLIR system is broken down into categories like Indexing (IR), translation, ranking and matching. Oard & Dorr [1] work is perhaps the first attempt to study various approaches and techniques of CLIR in a detailed manner and showed that CLIR is not exactly the combination of IR and MT and somewhere between them. IR focuses on retrieving the relevant documents given a query by a human user and MT aims at producing single accurate target equivalent of a given input text. CLIR s functionality may be a hybrid of both these systems in the sense that the IR engine of CLIR system can take multiple translations produced by a program (as opposed to human user s single query) and the translation system tries to process not so complete input 1
3 text (for example, just named entities, multiword expressions or simple sentence fragments) and produce multiple target language equivalents focusing less on producing grammatically correct translations. CLIA further improves this information and provide more meaningful representation of the information. In this paper we first describe the overall architecture of our CLIA system in Section 2. We see some preprocessing aspects like language identification during crawling in Section 3. Input processing aspects like query translation, transliteration and disambiguation issues will be mentioned in Section 4.Then we will see some of the challenges and building aspects of the end to end Telugu CLIA system in Section 5. We explain the experimental results obtained in each of the modules mentioned. Conclusion and Future Work will be mentioned in Section 7. II. ARCHITECTURE OF THE CLIA SYSTEM The functionality of the CLIA system can be broken down into two main phases A) Offline Processing and B) Online Processing. A. Offline Processing The offline processing phase encapsulates the following tasks crawling, parsing and indexing. 1) Crawling This task simply refers to the process of crawling the web and fetching all the documents found for further processing. The crawl starts by taking as input a set of starting URLs called as Seed URLs. Once these URLs are fetched, the out links from these documents are added to the queue and the crawler fetches them too. This process is repeated until the queue is empty or a present limit is reached. 2) Parsing The parsing phase is an important part of the offline phase. All processing of the fetched documents is done in this phase. The CLIA system does the following tasks during this phase not necessarily in the specified order. Font Transcoding: Conversion of web pages with non- standard or proprietary character or font encoding into standard UTF- 8. Language Identification: Identify the writing language of the document. Any language specific offline processing is done based on the language attribute of the document. Domain Identification: Categorize the webpage into one of the pre- decided domains. Current system supports only Tourism. This information is used for domain specific processing like deciding upon the IE template to be filled. Named Entity and Multi- word Expression Extraction: NEs and MWEs in the document are marked so as to facilitate other modules to utilize this information. Summary Generation: An extractive summary of the document is generated to be presented to the user in the search interface. Information Extraction: Domain specific IE templates are populated in this module for providing information to the user in a tabulated form. 3) Indexing 2
4 In the indexing phase, the various fields of a document are indexed and stored as an inverted index readily searchable. The important fields that are included in the index are the document title, content, language, domain and named entities. Document content is analyzed which includes stop word removal, stemming etc before indexing it. The most important modules in the offline processing phase are the Language Identification module and the Domain Identification module. These are discussed in detail in sections III and IV. B. Online Processing Online processing refers to all the tasks that are done from the time the user has fired a search query to the time the user sees the results page. A brief note on all the online processing tasks of CLIA is provided below. Named Entity and Multi- Word Expression extraction: A light weight NER and MWE analyzer is used to extract the named entities and multi word expressions so that they will be directly searched in the index without any further analysis like stemming. Query Processing: For a monolingual IR system this includes query reformulation to improve recall. For a cross lingual IR system this includes conversion of query from one language into another. Retrieval and Ranking: Retrieve relevant documents from the index and rank them as per their importance. Snippet generation: Generating a small excerpt from the document highlighting the query context so that the user can take a decision whether to visit the webpage or not. Snippet translation: This is an important feature of the CLIA system wherein the snippet is translated into the user s language for easier understanding. Query focused summary generation: Generating summary of the chosen document focusing on the query context. The entire functionality and performance of the CLIA system is dependant on the Query Processing module. This is discussed in detail in Section V. III. LANGUAGE IDENTIFICATION Language identification is a task of identifying the language of a page to which it belongs. There can be multiple languages present in a page, but categorizing it to the most dominating language present in the page is important. In the CLIA systems to index the pages of different languages initially during the crawling phase, language identification plays a major role. When a page is downloading during crawling, we initially identify the encoding of the font. If the font is proprietary then font is converted to UTF- 8 before doing the language identification. The language identifier categorizes the following Indian languages mainly Hindi, Tamil, Bengali, Telugu, Marathi and Punjabi in a document. For these it takes external data in the form of n- grams and dictionary of words in each language as input to train the n- gram 3
5 profiling model. When the UTF- 8 encoded page content is given to the language identifier it returns the Language ID for each of these Indian languages Hindi, Bengali, Tamil, Marathi, Punjabi and Telugu. IV. INPUT PROCESSING Input Processing involves post offline processing tasks that handle user inputs and feedback. Listed below are some of the non- trivial tasks in building efficient CLIA system. Query processing, which includes translation and transliteration, pre- translation query expansion and post translation query refinement and expansion, named entity and multiword expression recognition in query, word sense disambiguation are some of the tasks involved. Below we will discuss about query translation, transliteration and disambiguation aspects. A. Query Translation Query words which are other than named entities (NE) and multi-word expressions (MWE) needs to be translated from Indian language to the cross language i.e. Hindi and English to do further processing on the query. In order to translate them to cross language there is need of bilingual dictionaries in each of the Indian languages to English and Hindi. There are bilingual dictionaries stored for each of the Indian languages mentioned earlier consisting at least 20k parallel words for translation. When the query is fired in an Indian language the words other than NE s and MWE s will be searched through the bilingual dictionaries for exact translation. Then the query is reformed in the cross language before the query is fired in the search index. B. Query Transliteration Transliteration, the pronunciation- based translation or keyword- based translation from a source language to a desired target language, is important to many cross language and natural languages processing tasks. Also success of any CLIA system depends on efficiency of its transliteration of queries. When not carefully handled, the mean average precision (MAP) can reach 50% [2].Generally these queries may contain MWE s, NE s or general language terms. The challenge for the system is to identify the type of query terms and decide accordingly whether to translate or transliterate. Also, mining appropriate transliterations from the top results of the first- pass retrieval should be considered as it achieves enhanced cross lingual performance of the system overall, in addition to enhancing individual performance of more queries [3]. There are some drawbacks that need to be handled if transliteration does not produce the exact spelling variants used in the document collection. An approximate string matching technique is usually required to alleviate this drawback. By doing so, retrieval performance should not be compromised. If inaccurate or confusing transliterations occur in multiple alternative transliterations [4] appropriate action should be taken. Also, if translated or transliterated query is human readable it ensures proper transliteration and ultimately provides good search results. For transliterating NE s in the first step we took word- aligned bilingual corpus while for the second step statistics over the alignments to transliterate the source language word and generate the desired number of target language words is done. For this Hidden Markov Model 4
6 (HMM) alignment and Conditional Random Fields (CRFs), a discriminative model is used. HMM alignment maximizes the probability of the observed word pairs using the expectation maximization algorithm and then the character level alignments (n- gram) are set to maximum posterior predictions of the model while CRF uses forward Viterbi and backward A* search whose combination produce the exact n- best results. For HMM alignment GIZA++ is used while CRF++ for training the model. The list of named entity words identified that need to be transliterated is taken and using the trained model top 3 probable words is extracted. C. Query Disambiguation Basic research is needed to investigate the relationship between sense ambiguity, disambiguation, and information retrieval. Understanding the influence of ambiguity and disambiguation on a probabilistic IR system is required to improve the retrieval results. Query ambiguity is only problematic to an IR system when it is retrieving from very short queries [5]. Also word senses in a query has to resolve to high degree of accuracy. For CLIA, this creates much more complex problems as it involves understanding the query in multiple languages. To determine the sense of a word, a word sense disambiguation (WSD) algorithm typically should understand the context of the ambiguous word, external resources such as machine- readable dictionaries, or a combination of both. Although dictionaries provide useful word sense information and thesaurus provide additional information about relationships between words, they lack pragmatic information as unavailability of them in corpora. But there are major barriers in building a high- performing query disambiguation system as it includes difficulty of labeling data and predicting fine- grained sense distinctions. User queries can be diversified into different topics and areas and could mean differently in each of those areas. This creates problem of word disambiguation. Thus any system that attempts to retrieve good results has to determine the sense of a word from contextual features. V. END TO END TELUGU CLIA SYSTEM Our system focuses mainly on three languages: Telugu, Hindi and English. We analyze the performance of each language separately. In this section, we present an end to end separate CLIA system which is divided into different levels and modules. These levels consist of (a) Crawling, (b) Indexing and finally (C) Query processing, searching and ranking. We segregate each stage and explain the sub steps that were addressed at each level. A. Crawling A Crawler is a program, which browses the World Wide Web or the given document collection in a methodical automated manner and creates a copy of all the visited pages for later processing. In this CLIA system we have domain identifier [6] crawling system that looks for pages that fit a particular description. The pages belonging to certain selected topics, or satisfying certain criteria, are indexed and maintained. The crawler is guided by a classifier to recognize the relevance of the pages crawled. In this system we have specified domains, which crawls only Tourism and Health domains and also classifies the Indian languages. For this machine learning based SVM classifier is used. 5
7 B. Language Decoding As we are building system for Indian languages that have multiple encodings in each language needs to be addressed. Good number of language encodings is in usage on Internet and other document repositories. Especially in Indian language document repositories, these document collections are available in a set of national and international standards for Indian language character set encodings, while a number of publishers use proprietary non- standard encodings. For example Telugu language has some proprietary fonts like Hemalatah, Tikkana, Vaartha, Karthika etc.in order to be able to index such content, it is very important to identify the language and such encodings to achieve a better recall in retrieval by having much broader coverage. We use a set of heuristic based techniques using fonts and character ranges to recognize character encodings such as UTF- 8, ISCII or other proprietary encodings from HTML and PDF documents [7]. C. Indexing An indexer module should be capable of building data structures, which enable fast retrieval of relevant documents and also enable ranking of these documents. The design of data structure for index should also be sensitive to transaction aspects such as locking of indices for writes, updatability of index without difficulty etc. In a Cross- lingual IR task using a bilingual dictionary along with term co- occurrence statistics and language modeling approach [8] helps improve the functional IR performance. We used an augmented index model which is used for fast retrieval while having the benefits of language modeling in a CLIR task. This model is capable of retrieval and ranking with or without query expansion techniques using term collocation statistics of the indexed corpus [9]. D. Input Processing Query processing is important step for input processing in improving recall and precision of any CLIA system. Query pre- processing needs to be done to improve efficiency in retrieval. In this CLIA system query processing module uses n- gram based translation of query words using bi- lingual lexicons. Identification of named entities is done by Named Entity Recognizers. If translation is not possible transliteration of the identified named entities is been carried out. Also a Boolean query- scoring sub- module used to weigh Boolean queries to generate the output of the query- processing module. We briefly discuss some of the query processing steps performed in building this CLIA system for Telugu, Hindi and English. E. Query Processing List of stop words are prepared which is used to remove stop words from the query before translation or transliteration. Since Telugu is a highly agglutinative language and in some cases a whole phrase or a sentence from English could be translated into a single word in Telugu. For this we used some shallow rules [10] that can remove some common prefixes and suffixes. Some of the very common prefixes and suffixes are removed from a given Telugu 6
8 word. A given word is stemmed from these prefixes and suffixes recursively until a stem is obtained. Some of the other sub tasks in query pruning are listed below. Named Entities identification Named Entity Recognition is the task of identifying and classifying all proper nouns in a document as person names, organization names, location names, date & time expressions and miscellaneous. Conditional Random Fields (CRFs) [11] with two Indian languages Telugu and Hindi [12, 13] is used with the character n- gram based models for this task. Query translation using lexicons Resources containing English- Hindi and English- Telugu cross language dictionaries were primarily used in English to Indian language query translation. Available English- Hindi dictionary was conveniently formatted for machine processing; however the English- Telugu dictionary was a digitized version of a human readable dictionary. In order to convert the human readable dictionary to machine processing form, a set of regular expressions were used [7]. Transliteration Transliteration task is considered to be non- trivial as its output decides the quality of the results retrieved in CLIA systems. The basic principle followed was to compress the source word into its minimal form and align it across an indexed list of target language words to arrive at the top n- equivalents based on the modified edit distance. This Mapping model was used instead of alignment- based model as it ensures fastness. To speed up the process of arriving at the right match in the target index, Compressed Word Format (CWFs) of the target named entities and indexing the CWFs along with their actual forms is done. When any new query word arrives for transliteration CWF form is to be generated and is to be compared with the CWF of the source query word with the CWFs of the target language. Entries in the index of the matching entries with modified Levenshtein distance equal to zero will be returned. Query Scoring Query scoring helps to rank the retrieved results. Once the source language queries were translated and transliterated, the resultant target language keywords is to be used for constructing Boolean queries using the OR operator. If a particular keyword occurs in multiple sections of the query, it has to be given greater score compared to the other keywords accordingly. Hence, cumulative weight for each word to be calculated based on the number of occurrences [14]. Other important keywords like years, numbers, are also given higher weight factors. The translated query word in the Boolean query and the scoring weight are associated and the final query output for a given source language query from the system would be of the formed and used accordingly. F. Snippet Generation Snippet is a short summary of each document retrieved by the CLIA system for a user query. To retrieve the content of the retrieved document from search engine query words are given to the CLIA system. 7
9 In our system the query words will be highlighted in the snippet produced and then forwarded to the snippet Translation module. Extracted snippet is considered efficient if the sentence extraction has language specific sentence boundary delimiter task. For this title extracted using the title tags obtained from the HTML source file or from the document files is checked for accuracy. Each complete sentence will be checked for the presence of the query words in question. Sentences are ranked based on the number of query words, key words and domain ontology words present in the sentence. The generated short summary or snippet contains the context (five to ten words depending on the number of query words) of the highlighted query words subject to a predetermined length of the snippet. If no complete sentences are identified, the menu list on the web page will be listed as the snippet of the web page. G. Retrieval and Ranking Probabilistic retrieval algorithm BM25 [15] is used for cross lingual and monolingual information retrieval. It is a ranking function used to rank matching documents according to their relevance to a given search query. It uses bagof-words approach for retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of the inter-relationship between the query terms within a document. It is not a single function, but actually a whole family of scoring functions, with slightly different components and parameters. VI. EXPERIMENTAL RESULTS The evaluation of language identifier, Input processing and end- to- end CLIA system mentioned in earlier sections consists of analyzing and evaluating quantitatively to find the performance and issues in various components of the system. The evaluation framework will also dependent on the construction of benchmark data for the Telugu and Hindi languages in accordance with the requirements. Evaluation here depends on two major evaluations one is on methods and metrics and other is on creating benchmark data. Two of the most widely used measures of document retrieval effectiveness are Recall and Precision. Recall measures how well a system retrieves all the relevant documents; and Precision, how well the system retrieves only the relevant documents [16]. Precision scores have been evaluated for Language Identifier for each of the languages Hindi, Bengali, Marathi, Telugu, Punjabi and Tamil. Table 1 describes the module accuracy in identifying the languages. TABLE I. LANGUAGE IDENTIFIER ACCURACY Language Accuracy Telugu 98% Hindi 99% Tamil 99% Punjabi 97% Marati 98% Bengali 99% Input processing have sub tasks involved which needs to be evaluated separately. We will see the results obtained in the input processing task in end to end Telugu CLIA system. 8
10 End to End Telugu CLIA system has many sub tasks involved we will see evaluations done on each of them separately. Crawling Crawling was done to index only tourism domain related websites. Table 2 shows the accuracy achieved by the crawler in identifying the tourism related websites for Hindi and Telugu. TABLE II. DOMAIN IDENTIFIER Language Accuracy Telugu 83.5% Hindi 66% NER identification in corpus The corpus in total consisted of tokens out of which were NE s. Corpus is divided into training and testing sets. The training set consisted of tokens out of which 8485 were NE s. The testing set consisted of tokens out of which 2407 were NE s. A word window 6 words at a time having parts of speech (POS), suffix and prefix gave F- Measure 44.91% for Hindi. Another approach using n- grams has been performed which attained F- Measure of 44.57% for Hindi if n=4 and 49.62% for Telugu if n=3. Translation of Query Telugu to Hindi and Telugu to English Bilingual dictionaries has been developed which has 46k entries. Transliteration of Query Transliteration methods mentioned in the earlier sections using HMM and CRF models produced overall 87% precision when generated models are of Telugu to Hindi alignments and 83% for Telugu to English. Romanization method also tested which produced lower accuracies compared to HMM and CRF model. Overall system was evaluated using the P@5 and P@10 scores for 5 queries.table 3 below shows the scores for each of the queries. TABLE III. P@5 AND P@10 FOR 5 QUERIES FROM TELUGU-HINDI Query No. P@5 P@
11 TABLE IV. P@5 AND P@10 FOR 5 QUERIES FROM TELUGU-ENGLISH Query No. P@5 P@ VII. CONCLUSION AND FUTURE WORK We have described the CLIA systems pre processing and post processing tasks along with language identification and other specific tasks for building CLIA Telugu system. We understood that if the system handles information between pair of languages then they are called cross language systems. All cross language information access systems essentially benefit from the Information Retrieval technologies, which enable information access by retrieving a set of ranked documents that are likely to be relevant to the information. However, the role of IR technologies ends once the relevant documents are obtained. But there are other problems when dealing with CLIA systems. They require query translation/translation, NE identification, results ranking, which have dependencies on lexicons having good coverage, bilingual dictionaries, and parallel corpora for efficient information retrieval. In this paper, we have presented major challenges in building those CLIA systems and also we described end to end Telugu CLIA system. Also we will continuously working on the improvement on the existing system using various state of art approaches in input and output processing stages which constitutes our future work. ACKNOWLEDGMENT We acknowledge the funding we have received from Ministry of Communication and Information Technology (MCIT), Government of India and Cross Language Information Access (CLIA) consortium members helped us. REFERENCES [1] Oard, D., & Dorr, B. (1996). A Survey of Multilingual Text Retrieval. Technical Report UMIACS-TR-96-19, University of Maryland, Institute for Advanced Computer Studies. [2] Larkey, L., AbdulJaleel, N., & Connell, M. (2003).What s in a Name? : Proper Names in Arabic Cross Language Information Retrieval. CIIR Technical Report, IR-278, Univ. of Amherst. [3] Saravanan, K., Udupa, R., & Kumaran, A. (2010). Cross lingual Information Retrieval System Enhanced with Transliteration Generation and Mining. In Proceedings of Forum for Information Retrieval Evaluation (FIRE-2010) Workshop, Kolkata, India. 10
12 [4] Ea-Ee Jan, Shih-Hsiang Lin, & Berlin Chen. (2010). Transliteration Retrieval Model for Cross Lingual Information Retrieval. Lecture Notes in Computer Science, 2010, Volume 6458, Information Retrieval Technology, pp [5] Sanderson, M. (1994).Word sense disambiguation and information retrieval. In Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, Dublin, Ireland, p [6] Chakrabarti, S., Dom, B., & van den Berg, M. (1999). Focused Crawling: A New Approach for Topic-Specific Resource Discovery. In Proc. 8th World Wide Web Conf., Elsevier Science, Amsterdam, pp [7] Pingali, P., & Varma, V. (2006).Hindi and Telugu to English Cross Language Information Retrieval.Working Notes of Cross Language Evaluation Forum Workshop, Spain. [8] Pingali P., Kula K., T., & Varma, V. (2007). Hindi, Telugu, Oromo, English CLIR Evaluation. Evaluation of Multilingual and Multi-modal Information Retrieval. Vol. 4730, 2007, ISBN , Springer-Verlag. [9] Pingali, P., & Varma, V. (2007). Multilingual Indexing Support for CLIR using Language Modeling. In Bulletin of the IEEE Computer Society Technical Committee on Data Engineering. [10] Pingali, P., Jagarlamudi, J., & Varma, V. (2006). A Dictionary Based Approach with Query Expansion to Cross Language Query Based Multi-Document Summarization: Experiments in Telugu English. National Workshop on Artificial Intelligence, Mumbai, India. [11] Wallach H., M. (2004). Conditional random fields: An introduction. Technical Report MS-CIS-04-21, University of Pennsylvania, Department of Computer and Information Science, University of Pennsylvania. [12] Shishtla, P., M., Pingali, P., & Varma, V. (2008). A Character n-gram Based Approach for Improved Recall in Indian Language NER. In Proceedings of IJCNLP-08 Workshop on Named Entity Recognition for South and South East Asian Languages, Hyderabad, India, pp [13] Shishtla, P., M., Pingali, P., & Varma, V. (2008). Experiments in Telugu NER: A Conditional Random Field Approach. In Proceedings of IJCNLP-08 Workshop on Named Entity Recognition for South and South East Asian Languages, Hyderabad, India, pp [14] Ballesteros, L., & Croft W., B. (1997). Phrasal translation and query expansion techniques for cross-language information retrieval. In Proceedings of the 20th Annual International ACM Conference on Research and Development in Information Retrieval, Philadelphia, PA, July 27 31, pp [15] Robertson, S., & Zaragoza, H. (2009).Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval. pp [16] Saracevic, T. (1975). Relevance: A review of and a framework for thinking on the notion in information science. Journal of the American Society for Information Science and Technology, pp
Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India
International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Some Issues in Application of NLP to Intelligent
More informationMultilingual Information Retrieval
Proposal for Tutorial on Multilingual Information Retrieval Proposed by Arjun Atreya V Shehzaad Dhuliawala ShivaKarthik S Swapnil Chaudhari Under the direction of Prof. Pushpak Bhattacharyya Department
More informationResearch Article. August 2017
International Journals of Advanced Research in Computer Science and Software Engineering ISSN: 2277-128X (Volume-7, Issue-8) a Research Article August 2017 English-Marathi Cross Language Information Retrieval
More informationDeveloping Focused Crawlers for Genre Specific Search Engines
Developing Focused Crawlers for Genre Specific Search Engines Nikhil Priyatam Thesis Advisor: Prof. Vasudeva Varma IIIT Hyderabad July 7, 2014 Examples of Genre Specific Search Engines MedlinePlus Naukri.com
More informationDCU at FIRE 2013: Cross-Language!ndian News Story Search
DCU at FIRE 2013: Cross-Language!ndian News Story Search Piyush Arora, Jennifer Foster, and Gareth J. F. Jones CNGL Centre for Global Intelligent Content School of Computing, Dublin City University Glasnevin,
More informationA Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2
A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,
More informationAN UNSUPERVISED APPROACH TO DEVELOP IR SYSTEM: THE CASE OF URDU
AN UNSUPERVISED APPROACH TO DEVELOP IR SYSTEM: THE CASE OF URDU ABSTRACT Mohd. Shahid Husain Integral University, Lucknow Web Search Engines are best gifts to the mankind by Information and Communication
More informationChapter 2. Architecture of a Search Engine
Chapter 2 Architecture of a Search Engine Search Engine Architecture A software architecture consists of software components, the interfaces provided by those components and the relationships between them
More informationTHE WEB SEARCH ENGINE
International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) Vol.1, Issue 2 Dec 2011 54-60 TJPRC Pvt. Ltd., THE WEB SEARCH ENGINE Mr.G. HANUMANTHA RAO hanu.abc@gmail.com
More informationInformation Retrieval
Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,
More informationThis paper studies methods to enhance cross-language retrieval of domain-specific
Keith A. Gatlin. Enhancing Cross-Language Retrieval of Comparable Corpora Through Thesaurus-Based Translation and Citation Indexing. A master s paper for the M.S. in I.S. degree. April, 2005. 23 pages.
More informationInternational Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine
International Journal of Scientific & Engineering Research Volume 2, Issue 12, December-2011 1 Web Search Engine G.Hanumantha Rao*, G.NarenderΨ, B.Srinivasa Rao+, M.Srilatha* Abstract This paper explains
More informationDomain-specific Concept-based Information Retrieval System
Domain-specific Concept-based Information Retrieval System L. Shen 1, Y. K. Lim 1, H. T. Loh 2 1 Design Technology Institute Ltd, National University of Singapore, Singapore 2 Department of Mechanical
More information3 Publishing Technique
Publishing Tool 32 3 Publishing Technique As discussed in Chapter 2, annotations can be extracted from audio, text, and visual features. The extraction of text features from the audio layer is the approach
More informationInformation Retrieval
Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have
More informationTEXT CHAPTER 5. W. Bruce Croft BACKGROUND
41 CHAPTER 5 TEXT W. Bruce Croft BACKGROUND Much of the information in digital library or digital information organization applications is in the form of text. Even when the application focuses on multimedia
More informationChapter 27 Introduction to Information Retrieval and Web Search
Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval
More informationAutomatic Bangla Corpus Creation
Automatic Bangla Corpus Creation Asif Iqbal Sarkar, Dewan Shahriar Hossain Pavel and Mumit Khan BRAC University, Dhaka, Bangladesh asif@bracuniversity.net, pavel@bracuniversity.net, mumit@bracuniversity.net
More informationIJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:
IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, 20131 Improve Search Engine Relevance with Filter session Addlin Shinney R 1, Saravana Kumar T
More information1.
* 390/0/2 : 389/07/20 : 2 25-8223 ( ) 2 25-823 ( ) ISC SCOPUS L ISA http://jist.irandoc.ac.ir 390 22-97 - :. aminnezarat@gmail.com mosavit@pnu.ac.ir : ( ).... 00.. : 390... " ". ( )...2 2. 3. 4 Google..
More informationCross Lingual Query Dependent Snippet Generation
Cross Lingual Query Dependent Snippet Generation Pinaki Bhaskar, Sivaji Bandyopadhyay Computer Science and Engineering Department, Jadavpur University Kolkata 700032, India Abstract The present paper describes
More informationMIRACLE at ImageCLEFmed 2008: Evaluating Strategies for Automatic Topic Expansion
MIRACLE at ImageCLEFmed 2008: Evaluating Strategies for Automatic Topic Expansion Sara Lana-Serrano 1,3, Julio Villena-Román 2,3, José C. González-Cristóbal 1,3 1 Universidad Politécnica de Madrid 2 Universidad
More informationINTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) CONTEXT SENSITIVE TEXT SUMMARIZATION USING HIERARCHICAL CLUSTERING ALGORITHM
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & 6367(Print), ISSN 0976 6375(Online) Volume 3, Issue 1, January- June (2012), TECHNOLOGY (IJCET) IAEME ISSN 0976 6367(Print) ISSN 0976 6375(Online) Volume
More informationResPubliQA 2010
SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first
More informationCross-Language Information Retrieval using Dutch Query Translation
Cross-Language Information Retrieval using Dutch Query Translation Anne R. Diekema and Wen-Yuan Hsiao Syracuse University School of Information Studies 4-206 Ctr. for Science and Technology Syracuse, NY
More informationA Model for Information Retrieval Agent System Based on Keywords Distribution
A Model for Information Retrieval Agent System Based on Keywords Distribution Jae-Woo LEE Dept of Computer Science, Kyungbok College, 3, Sinpyeong-ri, Pocheon-si, 487-77, Gyeonggi-do, Korea It2c@koreaackr
More informationCross lingual Information Retrieval
Cross lingual Information Retrieval Chapter 1. CLIR and its challenges A large amount of information in the form of text, audio, video and other documents is available on the web. Users should be able
More informationWhat is this Song About?: Identification of Keywords in Bollywood Lyrics
What is this Song About?: Identification of Keywords in Bollywood Lyrics by Drushti Apoorva G, Kritik Mathur, Priyansh Agrawal, Radhika Mamidi in 19th International Conference on Computational Linguistics
More informationPart I: Data Mining Foundations
Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?
More informationOntology-Based Web Query Classification for Research Paper Searching
Ontology-Based Web Query Classification for Research Paper Searching MyoMyo ThanNaing University of Technology(Yatanarpon Cyber City) Mandalay,Myanmar Abstract- In web search engines, the retrieval of
More informationCross Lingual Information Retrieval Using Data Mining Methods
Association for Information Systems AIS Electronic Library (AISeL) AMCIS 2009 Proceedings Americas Conference on Information Systems (AMCIS) 2009 Cross Lingual Information Retrieval Using Data Mining Methods
More informationA Practical Passage-based Approach for Chinese Document Retrieval
A Practical Passage-based Approach for Chinese Document Retrieval Szu-Yuan Chi 1, Chung-Li Hsiao 1, Lee-Feng Chien 1,2 1. Department of Information Management, National Taiwan University 2. Institute of
More informationWeb Information Retrieval using WordNet
Web Information Retrieval using WordNet Jyotsna Gharat Asst. Professor, Xavier Institute of Engineering, Mumbai, India Jayant Gadge Asst. Professor, Thadomal Shahani Engineering College Mumbai, India ABSTRACT
More informationSearch Engines Information Retrieval in Practice
Search Engines Information Retrieval in Practice W. BRUCE CROFT University of Massachusetts, Amherst DONALD METZLER Yahoo! Research TREVOR STROHMAN Google Inc. ----- PEARSON Boston Columbus Indianapolis
More informationHandwritten Script Recognition at Block Level
Chapter 4 Handwritten Script Recognition at Block Level -------------------------------------------------------------------------------------------------------------------------- Optical character recognition
More informationData and Information Integration: Information Extraction
International OPEN ACCESS Journal Of Modern Engineering Research (IJMER) Data and Information Integration: Information Extraction Varnica Verma 1 1 (Department of Computer Science Engineering, Guru Nanak
More informationFocused crawling: a new approach to topic-specific Web resource discovery. Authors
Focused crawling: a new approach to topic-specific Web resource discovery Authors Soumen Chakrabarti Martin van den Berg Byron Dom Presented By: Mohamed Ali Soliman m2ali@cs.uwaterloo.ca Outline Why Focused
More informationLearning to find transliteration on the Web
Learning to find transliteration on the Web Chien-Cheng Wu Department of Computer Science National Tsing Hua University 101 Kuang Fu Road, Hsin chu, Taiwan d9283228@cs.nthu.edu.tw Jason S. Chang Department
More informationCIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets
CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets Arjumand Younus 1,2, Colm O Riordan 1, and Gabriella Pasi 2 1 Computational Intelligence Research Group,
More informationDon't Use a Lot When Little Will Do : Genre Identi cation Using URLs
Don't Use a Lot When Little Will Do : Genre Identi cation Using URLs by Nikhil Priyatam, Srinivasan Iyengar, Krish Perumal, Vasudeva Varma in 14th International Conference on Intelligent Text Processing
More informationDesigning and Building an Automatic Information Retrieval System for Handling the Arabic Data
American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far
More informationIntroduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.
Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How
More informationAn Approach To Web Content Mining
An Approach To Web Content Mining Nita Patil, Chhaya Das, Shreya Patanakar, Kshitija Pol Department of Computer Engg. Datta Meghe College of Engineering, Airoli, Navi Mumbai Abstract-With the research
More informationText Mining: A Burgeoning technology for knowledge extraction
Text Mining: A Burgeoning technology for knowledge extraction 1 Anshika Singh, 2 Dr. Udayan Ghosh 1 HCL Technologies Ltd., Noida, 2 University School of Information &Communication Technology, Dwarka, Delhi.
More informationSense-based Information Retrieval System by using Jaccard Coefficient Based WSD Algorithm
ISBN 978-93-84468-0-0 Proceedings of 015 International Conference on Future Computational Technologies (ICFCT'015 Singapore, March 9-30, 015, pp. 197-03 Sense-based Information Retrieval System by using
More informationAutomatic Text Summarization System Using Extraction Based Technique
Automatic Text Summarization System Using Extraction Based Technique 1 Priyanka Gonnade, 2 Disha Gupta 1,2 Assistant Professor 1 Department of Computer Science and Engineering, 2 Department of Computer
More informationTERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES
TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES Mu. Annalakshmi Research Scholar, Department of Computer Science, Alagappa University, Karaikudi. annalakshmi_mu@yahoo.co.in Dr. A.
More informationContext Based Indexing in Search Engines: A Review
International Journal of Computer (IJC) ISSN 2307-4523 (Print & Online) Global Society of Scientific Research and Researchers http://ijcjournal.org/ Context Based Indexing in Search Engines: A Review Suraksha
More informationSearch Engines. Information Retrieval in Practice
Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Web Crawler Finds and downloads web pages automatically provides the collection for searching Web is huge and constantly
More informationContext Based Web Indexing For Semantic Web
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 12, Issue 4 (Jul. - Aug. 2013), PP 89-93 Anchal Jain 1 Nidhi Tyagi 2 Lecturer(JPIEAS) Asst. Professor(SHOBHIT
More informationKnowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.
Knowledge Retrieval Franz J. Kurfess Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A. 1 Acknowledgements This lecture series has been sponsored by the European
More informationFull Text Search in Multi-lingual Documents - A Case Study describing Evolution of the Technology At Spectrum Business Support Ltd.
Full Text Search in Multi-lingual Documents - A Case Study describing Evolution of the Technology At Spectrum Business Support Ltd. This paper was presented at the ICADL conference December 2001 by Spectrum
More informationEmpirical Analysis of Single and Multi Document Summarization using Clustering Algorithms
Engineering, Technology & Applied Science Research Vol. 8, No. 1, 2018, 2562-2567 2562 Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms Mrunal S. Bewoor Department
More informationHandling Place References in Text
Handling Place References in Text Introduction Most (geographic) information is available in the form of textual documents Place reference resolution involves two-subtasks: Recognition : Delimiting occurrences
More informationFinding Topic-centric Identified Experts based on Full Text Analysis
Finding Topic-centric Identified Experts based on Full Text Analysis Hanmin Jung, Mikyoung Lee, In-Su Kang, Seung-Woo Lee, Won-Kyung Sung Information Service Research Lab., KISTI, Korea jhm@kisti.re.kr
More informationChapter 6: Information Retrieval and Web Search. An introduction
Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods
More informationNUS-I2R: Learning a Combined System for Entity Linking
NUS-I2R: Learning a Combined System for Entity Linking Wei Zhang Yan Chuan Sim Jian Su Chew Lim Tan School of Computing National University of Singapore {z-wei, tancl} @comp.nus.edu.sg Institute for Infocomm
More informationComment Extraction from Blog Posts and Its Applications to Opinion Mining
Comment Extraction from Blog Posts and Its Applications to Opinion Mining Huan-An Kao, Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan
More informationA Hybrid Unsupervised Web Data Extraction using Trinity and NLP
IJIRST International Journal for Innovative Research in Science & Technology Volume 2 Issue 02 July 2015 ISSN (online): 2349-6010 A Hybrid Unsupervised Web Data Extraction using Trinity and NLP Anju R
More informationBing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer
Bing Liu Web Data Mining Exploring Hyperlinks, Contents, and Usage Data With 177 Figures Springer Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web
More informationA Review on Identifying the Main Content From Web Pages
A Review on Identifying the Main Content From Web Pages Madhura R. Kaddu 1, Dr. R. B. Kulkarni 2 1, 2 Department of Computer Scienece and Engineering, Walchand Institute of Technology, Solapur University,
More informationInformation Retrieval Spring Web retrieval
Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic
More informationBasic techniques. Text processing; term weighting; vector space model; inverted index; Web Search
Basic techniques Text processing; term weighting; vector space model; inverted index; Web Search Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing
More informationCLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval
DCU @ CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval Walid Magdy, Johannes Leveling, Gareth J.F. Jones Centre for Next Generation Localization School of Computing Dublin City University,
More informationA Document Graph Based Query Focused Multi- Document Summarizer
A Document Graph Based Query Focused Multi- Document Summarizer By Sibabrata Paladhi and Dr. Sivaji Bandyopadhyay Department of Computer Science and Engineering Jadavpur University Jadavpur, Kolkata India
More informationA User Study on Features Supporting Subjective Relevance for Information Retrieval Interfaces
A user study on features supporting subjective relevance for information retrieval interfaces Lee, S.S., Theng, Y.L, Goh, H.L.D., & Foo, S. (2006). Proc. 9th International Conference of Asian Digital Libraries
More informationNoida institute of engineering and technology,greater noida
Impact Of Word Sense Ambiguity For English Language In Web IR Prachi Gupta 1, Dr.AnuragAwasthi 2, RiteshRastogi 3 1,2,3 Department of computer Science and engineering, Noida institute of engineering and
More informationCross-Lingual Information Access and Its Evaluation
Cross-Lingual Information Access and Its Evaluation Noriko Kando Research and Development Department National Center for Science Information Systems (NACSIS), Japan URL: http://www.rd.nacsis.ac.jp/~{ntcadm,kando}/
More informationAN EFFECTIVE INFORMATION RETRIEVAL FOR AMBIGUOUS QUERY
Asian Journal Of Computer Science And Information Technology 2: 3 (2012) 26 30. Contents lists available at www.innovativejournal.in Asian Journal of Computer Science and Information Technology Journal
More informationInformation Retrieval. Information Retrieval and Web Search
Information Retrieval and Web Search Introduction to IR models and methods Information Retrieval The indexing and retrieval of textual documents. Searching for pages on the World Wide Web is the most recent
More informationNOVEL IMPLEMENTATION OF SEARCH ENGINE FOR TELUGU DOCUMENTS WITH SYLLABLE N- GRAM MODEL
NOVEL IMPLEMENTATION OF SEARCH ENGINE FOR TELUGU DOCUMENTS WITH SYLLABLE N- GRAM MODEL DR.B.PADMAJA RANI* AND DR.A.VINAY BABU 1 *Associate Professor Department of CSE JNTUCEH Hyderabad A.P. India http://jntuceh.ac.in/csstaff.htm
More informationCLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL
STUDIA UNIV. BABEŞ BOLYAI, INFORMATICA, Volume LVII, Number 4, 2012 CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL IOAN BADARINZA AND ADRIAN STERCA Abstract. In this paper
More informationImprovement of Web Search Results using Genetic Algorithm on Word Sense Disambiguation
Volume 3, No.5, May 24 International Journal of Advances in Computer Science and Technology Pooja Bassin et al., International Journal of Advances in Computer Science and Technology, 3(5), May 24, 33-336
More informationInformation Retrieval May 15. Web retrieval
Information Retrieval May 15 Web retrieval What s so special about the Web? The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically
More informationRLAT Rapid Language Adaptation Toolkit
RLAT Rapid Language Adaptation Toolkit Tim Schlippe May 15, 2012 RLAT Rapid Language Adaptation Toolkit - 2 RLAT Rapid Language Adaptation Toolkit RLAT Rapid Language Adaptation Toolkit - 3 Outline Introduction
More informationIJRIM Volume 2, Issue 2 (February 2012) (ISSN )
AN ENHANCED APPROACH TO OPTIMIZE WEB SEARCH BASED ON PROVENANCE USING FUZZY EQUIVALENCE RELATION BY LEMMATIZATION Divya* Tanvi Gupta* ABSTRACT In this paper, the focus is on one of the pre-processing technique
More informationWord Disambiguation in Web Search
Word Disambiguation in Web Search Rekha Jain Computer Science, Banasthali University, Rajasthan, India Email: rekha_leo2003@rediffmail.com G.N. Purohit Computer Science, Banasthali University, Rajasthan,
More informationDiscriminative Training with Perceptron Algorithm for POS Tagging Task
Discriminative Training with Perceptron Algorithm for POS Tagging Task Mahsa Yarmohammadi Center for Spoken Language Understanding Oregon Health & Science University Portland, Oregon yarmoham@ohsu.edu
More informationA Frequent Max Substring Technique for. Thai Text Indexing. School of Information Technology. Todsanai Chumwatana
School of Information Technology A Frequent Max Substring Technique for Thai Text Indexing Todsanai Chumwatana This thesis is presented for the Degree of Doctor of Philosophy of Murdoch University May
More informationRecognition of online captured, handwritten Tamil words on Android
Recognition of online captured, handwritten Tamil words on Android A G Ramakrishnan and Bhargava Urala K Medical Intelligence and Language Engineering (MILE) Laboratory, Dept. of Electrical Engineering,
More informationUsing Monolingual Clickthrough Data to Build Cross-lingual Search Systems
Using Monolingual Clickthrough Data to Build Cross-lingual Search Systems ABSTRACT Vamshi Ambati Institute for Software Research International Carnegie Mellon University Pittsburgh, PA vamshi@cmu.edu A
More informationCS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University
CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and
More informationDomain Based Named Entity Recognition using Naive Bayes
AUSTRALIAN JOURNAL OF BASIC AND APPLIED SCIENCES ISSN:1991-8178 EISSN: 2309-8414 Journal home page: www.ajbasweb.com Domain Based Named Entity Recognition using Naive Bayes Classification G.S. Mahalakshmi,
More informationImproving Synoptic Querying for Source Retrieval
Improving Synoptic Querying for Source Retrieval Notebook for PAN at CLEF 2015 Šimon Suchomel and Michal Brandejs Faculty of Informatics, Masaryk University {suchomel,brandejs}@fi.muni.cz Abstract Source
More informationAutomatic New Topic Identification in Search Engine Transaction Log Using Goal Programming
Proceedings of the 2012 International Conference on Industrial Engineering and Operations Management Istanbul, Turkey, July 3 6, 2012 Automatic New Topic Identification in Search Engine Transaction Log
More informationEvaluating the Usefulness of Sentiment Information for Focused Crawlers
Evaluating the Usefulness of Sentiment Information for Focused Crawlers Tianjun Fu 1, Ahmed Abbasi 2, Daniel Zeng 1, Hsinchun Chen 1 University of Arizona 1, University of Wisconsin-Milwaukee 2 futj@email.arizona.edu,
More informationCADIAL Search Engine at INEX
CADIAL Search Engine at INEX Jure Mijić 1, Marie-Francine Moens 2, and Bojana Dalbelo Bašić 1 1 Faculty of Electrical Engineering and Computing, University of Zagreb, Unska 3, 10000 Zagreb, Croatia {jure.mijic,bojana.dalbelo}@fer.hr
More informationOverview of FIRE 2011 Prasenjit Majumder on behalf of the FIRE team
Overview of FIRE 2011 Prasenjit Majumder on behalf of the FIRE team Overview of FIRE 2011 p. 1/21 Overview Background Tasks Data Results Problems and prospects People Overview of FIRE 2011 p. 2/21 Background
More informationCompetitive Intelligence and Web Mining:
Competitive Intelligence and Web Mining: Domain Specific Web Spiders American University in Cairo (AUC) CSCE 590: Seminar1 Report Dr. Ahmed Rafea 2 P age Khalid Magdy Salama 3 P age Table of Contents Introduction
More informationMURDOCH RESEARCH REPOSITORY
MURDOCH RESEARCH REPOSITORY http://researchrepository.murdoch.edu.au/ This is the author s final version of the work, as accepted for publication following peer review but without the publisher s layout
More informationSearching and Organizing Images Across Languages
Searching and Organizing Images Across Languages Paul Clough University of Sheffield Western Bank Sheffield, UK +44 114 222 2664 p.d.clough@sheffield.ac.uk Mark Sanderson University of Sheffield Western
More informationExtracting Summary from Documents Using K-Mean Clustering Algorithm
Extracting Summary from Documents Using K-Mean Clustering Algorithm Manjula.K.S 1, Sarvar Begum 2, D. Venkata Swetha Ramana 3 Student, CSE, RYMEC, Bellary, India 1 Student, CSE, RYMEC, Bellary, India 2
More informationWeb Data Extraction and Alignment Tools: A Survey Pranali Nikam 1 Yogita Gote 2 Vidhya Ghogare 3 Jyothi Rapalli 4
IJSRD - International Journal for Scientific Research & Development Vol. 3, Issue 01, 2015 ISSN (online): 2321-0613 Web Data Extraction and Alignment Tools: A Survey Pranali Nikam 1 Yogita Gote 2 Vidhya
More informationTilburg University. Authoritative re-ranking of search results Bogers, A.M.; van den Bosch, A. Published in: Advances in Information Retrieval
Tilburg University Authoritative re-ranking of search results Bogers, A.M.; van den Bosch, A. Published in: Advances in Information Retrieval Publication date: 2006 Link to publication Citation for published
More informationCHAPTER 5 EXPERT LOCATOR USING CONCEPT LINKING
94 CHAPTER 5 EXPERT LOCATOR USING CONCEPT LINKING 5.1 INTRODUCTION Expert locator addresses the task of identifying the right person with the appropriate skills and knowledge. In large organizations, it
More informationEffect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching
Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching Wolfgang Tannebaum, Parvaz Madabi and Andreas Rauber Institute of Software Technology and Interactive Systems, Vienna
More informationEffective Tweet Contextualization with Hashtags Performance Prediction and Multi-Document Summarization
Effective Tweet Contextualization with Hashtags Performance Prediction and Multi-Document Summarization Romain Deveaud 1 and Florian Boudin 2 1 LIA - University of Avignon romain.deveaud@univ-avignon.fr
More informationLog Linear Model for String Transformation Using Large Data Sets
Log Linear Model for String Transformation Using Large Data Sets Mr.G.Lenin 1, Ms.B.Vanitha 2, Mrs.C.K.Vijayalakshmi 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology,
More informationA hybrid method to categorize HTML documents
Data Mining VI 331 A hybrid method to categorize HTML documents M. Khordad, M. Shamsfard & F. Kazemeyni Electrical & Computer Engineering Department, Shahid Beheshti University, Iran Abstract In this paper
More informationInformation Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Information Retrieval CS 6900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Information Retrieval Information Retrieval (IR) is finding material of an unstructured
More information