HYBRIDIZED MODEL FOR EFFICIENT MATCHING AND DATA PREDICTION IN INFORMATION RETRIEVAL

Similar documents
Ontology-Based Web Query Classification for Research Paper Searching

A Metric for Inferring User Search Goals in Search Engines

Automatic Search Engine Evaluation with Click-through Data Analysis. Yiqun Liu State Key Lab of Intelligent Tech. & Sys Jun.

Keyword Extraction by KNN considering Similarity among Features

Northeastern University in TREC 2009 Million Query Track

Encoding Words into String Vectors for Word Categorization

CS Search Engine Technology

A NEW CLUSTER MERGING ALGORITHM OF SUFFIX TREE CLUSTERING

Inferring User Search for Feedback Sessions

Letter Pair Similarity Classification and URL Ranking Based on Feedback Approach

A New Technique to Optimize User s Browsing Session using Data Mining

Review: Searching the Web [Arasu 2001]

A SURVEY ON WEB FOCUSED INFORMATION EXTRACTION ALGORITHMS

Multimodal Information Spaces for Content-based Image Retrieval

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _

Information Retrieval Issues on the World Wide Web

Search Engine Architecture. Hongning Wang

Information Retrieval and Web Search

IMPROVING INFORMATION RETRIEVAL BASED ON QUERY CLASSIFICATION ALGORITHM

An Approach to Manage and Search for Software Components *

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures

Link Analysis in Web Information Retrieval

Automatic Query Type Identification Based on Click Through Information

second_language research_teaching sla vivian_cook language_department idl

A Model for Interactive Web Information Retrieval

Domain Specific Search Engine for Students

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

An Improvement of Search Results Access by Designing a Search Engine Result Page with a Clustering Technique

Distributed Processing of Conjunctive Queries

Exploiting Global Impact Ordering for Higher Throughput in Selective Search

Personalizing PageRank Based on Domain Profiles

Oleksandr Kuzomin, Bohdan Tkachenko

CS290N Summary Tao Yang

Naïve Bayes for text classification

String Vector based KNN for Text Categorization

Automatic Identification of User Goals in Web Search [WWW 05]

Kohei Arai 1 Graduate School of Science and Engineering Saga University Saga City, Japan

SE4SC: A Specific Search Engine for Software Components *

Query Disambiguation from Web Search Logs

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

Batch-Incremental vs. Instance-Incremental Learning in Dynamic and Evolving Data

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page

Support System- Pioneering approach for Web Data Mining

Automated Online News Classification with Personalization

Ranking models in Information Retrieval: A Survey

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

Ternary Tree Tree Optimalization for n-gram for n-gram Indexing

A Framework for Incremental Hidden Web Crawler

Use of Temporal Expressions in Web Search

STUDYING OF CLASSIFYING CHINESE SMS MESSAGES

Enhanced Retrieval of Web Pages using Improved Page Rank Algorithm

Searching the Web for Information

A Survey on k-means Clustering Algorithm Using Different Ranking Methods in Data Mining

Information Retrieval. hussein suleman uct cs

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES


[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

60-538: Information Retrieval

Recent Researches on Web Page Ranking

Web Structure Mining using Link Analysis Algorithms

Information Gathering Support Interface by the Overview Presentation of Web Search Results

Detecting Spam with Artificial Neural Networks

International Journal of Research in Computer and Communication Technology, Vol 3, Issue 11, November

Visualization of User Eye Movements for Search Result Pages

AN ENHANCED ATTRIBUTE RERANKING DESIGN FOR WEB IMAGE SEARCH

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Using Hyperlink Features to Personalize Web Search

Searching the Web [Arasu 01]

PERSONALIZED MOBILE SEARCH ENGINE BASED ON MULTIPLE PREFERENCE, USER PROFILE AND ANDROID PLATFORM

Supervised Ranking for Plagiarism Source Retrieval

Indexing in Search Engines based on Pipelining Architecture using Single Link HAC

REDUNDANCY REMOVAL IN WEB SEARCH RESULTS USING RECURSIVE DUPLICATION CHECK ALGORITHM. Pudukkottai, Tamil Nadu, India

Analysis on the technology improvement of the library network information retrieval efficiency

Semantic Preference Retrieval for Querying Knowledge Bases

Relevancy Measurement of Retrieved Webpages Using Ruzicka Similarity Measure

Enabling Users to Visually Evaluate the Effectiveness of Different Search Queries or Engines

Collaborative Filtering using Euclidean Distance in Recommendation Engine

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

A PRELIMINARY STUDY ON THE EXTRACTION OF SOCIO-TOPICAL WEB KEYWORDS

An Application of Personalized PageRank Vectors: Personalized Search Engine

Query Modifications Patterns During Web Searching

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule

Tag Based Image Search by Social Re-ranking

Keywords APSE: Advanced Preferred Search Engine, Google Android Platform, Search Engine, Click-through data, Location and Content Concepts.

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search

Introduction to Information Retrieval

AN ENCODING TECHNIQUE BASED ON WORD IMPORTANCE FOR THE CLUSTERING OF WEB DOCUMENTS

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:

Keywords Data alignment, Data annotation, Web database, Search Result Record

Optimizing Search Engines using Click-through Data

Comment Extraction from Blog Posts and Its Applications to Opinion Mining

Linking Entities in Chinese Queries to Knowledge Graph

An Efficient Methodology for Image Rich Information Retrieval

WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW

ARCHITECTURE AND IMPLEMENTATION OF A NEW USER INTERFACE FOR INTERNET SEARCH ENGINES

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

Correlation Based Feature Selection with Irrelevant Feature Removal

Recommendation Algorithms: Collaborative Filtering. CSE 6111 Presentation Advanced Algorithms Fall Presented by: Farzana Yasmeen

A Few Things to Know about Machine Learning for Web Search

Transcription:

International Journal of Mechanical Engineering & Computer Sciences, Vol.1, Issue 1, Jan-Jun, 2017, pp 12-17 HYBRIDIZED MODEL FOR EFFICIENT MATCHING AND DATA PREDICTION IN INFORMATION RETRIEVAL BOMA P. W 1, OGHENEVOR E. E 2, ALABI O. A 3 1,2,3 DEPARTMEN T OF COMPUTER SCIENCE UNIVERSITY OF PORT HARCOURT, NIGERIA www.arseam.com ABSTRACT Information Retrieval (IR) has become a topic of great interest with the advent of text search engines on the Internet. Retrieving information from large databases can sometimes be very difficult and may tend to be very slow. This is especially so when there is need to manage these data or documents. This is because databases contain millions of documents in myriad subject areas. This is why information retrieval (IR) is very important. An IR systems matches user queries to documents stored in a database. Despite all these efforts, retrieving documents and texts is still problematic as none has been fully efficient in terms of speed. In this research work, we combined two known techniques of information retrieval: K-nearest neighbor (KNN) and Support Vector Machine (SVM) to mine and retrieve information more efficiently. We are going to compare our results with other results and see how efficiently our model is over the other models. Keywords: Information Retrieval, guery. Introduction: Retrieval of text-based information also termed Information Retrieval (IR) has become a topic of great interest with the advent of text search engines on the Internet (Baeza-Yates and Ribeiro- Neto, 1999). Traditionally in Information Retrieval (IR), documents and text queries both are represented in a unified manner, as sets of terms, to compute the distances between queries and documents thus providing a framework to directly implement simple text retrieval algorithms (Beitzel et al., 2007). submit paper : editor@arseam.com download full paper : www.arseam.com 12

Boma P. et. al / Hybridized Model for Efficient Matching and Data Prediction in Information Retrieval Materials and methods Retrieving information from the Internet and from large database is quite difficult and time consuming especially if such information is unstructured. A lot of algorithms and techniques have been developed in the area of data mining and information retrieval yet retrieving data from large databases continue to be problematic. In this research work, we combined two well-known techniques of information retrieval with the hope that our model could capitalize on the advantages of the two techniques and produce a better and more efficient technique or model for information retrieval. Geng et al. (2008) proposed K-nearest neighbor (KNN) for retrieving information by ranking the pages. The work employs KNN to rank different queries and conducted query-dependent ranking. An online method which created a ranking model for a given feature space and then rank the documents with respect to the query using the model. Then they used two offline approximation of the method to create the ranking models in advance to enhance the efficiency of ranking. They further prove that the approximations are accurate in terms of difference in loss of prediction if the learning algorithm used is stable with respect to minor changes in training examples. In retrieving information using K-NN technique, the documents are classified based on their similarities. K-NN is then used to retrieve the most similar k documents for each query document. This is done by ranking the candidate categories by vote using the weighted average and then using support vector machine (SVM) to predict the categories of the document as the most voted categories. This is done by finding a set of keywords that have similarity in meanings for the k-nearest neighbor. Queries with similar meanings and contain tags are then searched using closely related keywords. An algorithm is then constructed for the retrieval of documents that are sent to the database based on the queries. As soon as a neighborhood of similar to the words in the search space is established, the algorithm considers on tagged queries or tagged keywords that are used in the query and then assign weight to them by calculating the weight for each tag which we refer to as w t (i.e., weight for each tag). The average weight of these similar keywords is the calculated and the query tag is applied. submit paper : editor@arseam.com download full paper : www.arseam.com 13

International Journal of Mechanical Engineering & Computer Sciences, Vol.1, Issue 1, Jan-Jun, 2017, pp 12-17 RESULTS User Interface Text User Need Text Operations Logical View User feedback Query Operations Indexing Database Manager Inverted File Query Neural KNN Networ k Searching SVM Index Database Remarked Documents Remarking Retrieved Documents Fig. 1: Architecture of the proposed model Document from. It has three parts: the database, the data mining application, and the front end which is also the interface where a user can submit his query and get the result of the retrieved information. It also contains the program manager and the metadata layers as shown in figure 3.2. The metadata contains the input data, the staging area, the result of the retrieved information, and the metadata. submit paper : editor@arseam.com download full paper : www.arseam.com 14

Boma P. et. al / Hybridized Model for Efficient Matching and Data Prediction in Information Retrieval Fig. 2: Metadata layers The data collected are shown in table 1. Folksonomy Citeulike Delicious Users 2,051 18,105 Keywords 6,245 52,317 Tags 3,343 13,053 Posts 42,277 2,309,427 Annotations 105,853 8,815,545 submit paper : editor@arseam.com download full paper : www.arseam.com 15

International Journal of Mechanical Engineering & Computer Sciences, Vol.1, Issue 1, Jan-Jun, 2017, pp 12-17 10,000,000 9,000,000 8,000,000 7,000,000 6,000,000 5,000,000 4,000,000 3,000,000 2,000,000 1,000,000 0 Keywords Tags Posts Annotatio n Citeulike 5,233 6,245 3,343 42,277 105,853 Delicious 21, 415 52,317 13,053 2,309,429 8,815,545 Figure 2: From the result, we can see that the number of users in Citeulike are 2,051, keywords is 6,245, tags is 3,343, posts is 42,277, annotations is 105,353 while for delicious, the number of users is 18,105, keywords is 52,317, tags is 13,053, posts is 2,309,429, and annotations is 8,815,545. The result shows that there are more users of Delicious and that the keywords and tags are very important in document searching and information retrieval. Discussion When we compared this hybridized technique with the techniques used that used KNN and Space vector machine separately, it was found that the performance of this approach is better than when the tools are used separately. That is, our technique is able to retrieve information faster with significant lesser time. REFERENCES Anagnostopoulos, A., Broder, A., and Carmel, D. (2005), Sampling Search Engine Results, ACM Journal of Computer Networks, pp. 245 256. Arasu, A., Cho, J., Garcia-Molina, H., Paepcke, A. and Raghavan, S. (2001), Searching the Web, ACM Transaction on Internet technology, Vol. 1, No. 1, pp. 2 43. submit paper : editor@arseam.com download full paper : www.arseam.com 16

Boma P. et. al / Hybridized Model for Efficient Matching and Data Prediction in Information Retrieval Baeza-Yates, R., and Ribeiro-Neto, B. (1999), Modern Information Retrieval, ACM Press. Barlow, L. (2004). How to Use Web Search Engines, Tips on Using Search Site Like Google, alltheweb and Yahoo. Brandt, R. (2009). Starting Up, How Google Got Its Groove, Stanford Magazine. Beitzel, S. M. Jensen, E. C. Chowdhury, A. and Frieder, O. (2007). Varying Approaches to Topical Web Query Classification. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 783 784, New York, NY, USA, 2007. Brin, S. and Page, L. (1998). The Anatomy of a Large-Scale Hypertextual Web Se\arch Engine, WWW7/Computer Networks and ISDN Systems, Vol. 30, pp. 107 117. Broder, A. Z. (2002). A Taxonomy of Web Search, SIGIR Forum, Vol. 36, No. 2, pp. 3 10. Geng, X., Liu, T. Y., Qin, T., Arnold, A., Li, H. and Shum, H.-Y. (2008). Query Dependent Ranking Using K-Nearest Neighbor, SIGIR 08, July 20-24, 2008, Singapore. submit paper : editor@arseam.com download full paper : www.arseam.com 17