Proposed System. Start. Search parameter definition. User search criteria (input) usefulness score > 0.5. Retrieve results

Similar documents
International Journal of Computer Engineering and Applications, Volume IX, Issue X, Oct. 15 ISSN

A Survey on Keyword Diversification Over XML Data

Inverted Index for Fast Nearest Neighbour

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Flight Recommendation System based on user feedback, weighting technique and context aware recommendation system

Hybrid Recommendation System Using Clustering and Collaborative Filtering

Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms

Network Load Balancing

REDUNDANCY REMOVAL IN WEB SEARCH RESULTS USING RECURSIVE DUPLICATION CHECK ALGORITHM. Pudukkottai, Tamil Nadu, India

GRID SIMULATION FOR DYNAMIC LOAD BALANCING

INTERSOCIAL: Unleashing the Power of Social Networks for Regional SMEs

A Comparative Study of Selected Classification Algorithms of Data Mining

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

Diversification of Query Interpretations and Search Results

A NOVEL APPROACH FOR TEST SUITE PRIORITIZATION

Survey on Community Question Answering Systems

An Efficient Methodology for Image Rich Information Retrieval

Efficient Auditable Access Control Systems for Public Shared Cloud Storage

Data Analytics on RAMCloud

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING

A Real Time GIS Approximation Approach for Multiphase Spatial Query Processing Using Hierarchical-Partitioned-Indexing Technique

Improvement of Web Search Results using Genetic Algorithm on Word Sense Disambiguation

Joint Entity Resolution

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015

ABSTRACT I. INTRODUCTION II. METHODS AND MATERIAL

Privacy Protection in Personalized Web Search with User Profile

Implementation of Enhanced Web Crawler for Deep-Web Interfaces

A Survey On Diversification Techniques For Unabmiguous But Under- Specified Queries

Web Service Recommendation Using Hybrid Approach

Keywords: clustering algorithms, unsupervised learning, cluster validity

MODELING OF CPU USAGE FOR VIRTUALIZED APPLICATION

Volume 2, Issue 4, April 2014 International Journal of Advance Research in Computer Science and Management Studies

Document Clustering For Forensic Investigation

DOLLAR GENERAL CAREER SITE CANDIDATE ONLINE APPLICATION REFERENCE GUIDE

Inferring User Search for Feedback Sessions

A Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET)

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

Tag Based Image Search by Social Re-ranking

Correlation Based Feature Selection with Irrelevant Feature Removal

QUERY RECOMMENDATION SYSTEM USING USERS QUERYING BEHAVIOR

Improving Latent Fingerprint Matching Performance by Orientation Field Estimation using Localized Dictionaries

Evaluation of Keyword Search System with Ranking

COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION

A Survey on k-means Clustering Algorithm Using Different Ranking Methods in Data Mining

CSE 494 Project C. Garrett Wolf

Collaborative Filtering using Euclidean Distance in Recommendation Engine

FStream: a decentralized and social music streamer

Trusted Profile Identification and Validation Model

Enhancing the Efficiency of Radix Sort by Using Clustering Mechanism

Power Aware Hierarchical Epidemics in P2P Systems Emrah Çem, Tuğba Koç, Öznur Özkasap Koç University, İstanbul

Keyword Search Using General Form of Inverted Index

A New Technique to Optimize User s Browsing Session using Data Mining

Content Based Smart Crawler For Efficiently Harvesting Deep Web Interface

NTU Approaches to Subtopic Mining and Document Ranking at NTCIR-9 Intent Task

Social Data Exploration

An Efficient Reduced Pattern Count Tree Method for Discovering Most Accurate Set of Frequent itemsets

Improving the Performance of Search Engine With Respect To Content Mining Kr.Jansi, L.Radha

A FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM

DYNAMIC DATA STORAGE AND PLACEMENT SYSTEM BASED ON THE CATEGORY AND POPULARITY

Graph Sampling Approach for Reducing. Computational Complexity of. Large-Scale Social Network

Review Article AN ANALYSIS ON THE PERFORMANCE OF VARIOUS REPLICA ALLOCATION ALGORITHMS IN CLOUD USING MATLAB

Security Control Methods for Statistical Database

FAST DATA RETRIEVAL USING MAP REDUCE: A CASE STUDY

IJESRT. Scientific Journal Impact Factor: (ISRA), Impact Factor: [35] [Rana, 3(12): December, 2014] ISSN:

ARC Sort: Enhanced and Time Efficient Sorting Algorithm

Centralized Access of User Data Channel with Push Notification

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

Network Based Hard/Soft Information Fusion Data Association Process Gregory Tauer, Kedar Sambhoos, Rakesh Nagi (co-pi), Moises Sudit (co-pi)

Design and Implementation of Search Engine Using Vector Space Model for Personalized Search

Spatial Index Keyword Search in Multi- Dimensional Database

Mining Distributed Frequent Itemset with Hadoop

Theme Identification in RDF Graphs

NOVEL IMPLEMENTATION OF SEARCH ENGINE FOR TELUGU DOCUMENTS WITH SYLLABLE N- GRAM MODEL

ABSTRACT I. INTRODUCTION

Text Classification for Spam Using Naïve Bayesian Classifier

Research Article QOS Based Web Service Ranking Using Fuzzy C-means Clusters

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

AN ENHANCED ATTRIBUTE RERANKING DESIGN FOR WEB IMAGE SEARCH

Keywords Data alignment, Data annotation, Web database, Search Result Record

Survey on MapReduce Scheduling Algorithms

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management

Implementation of Smart Question Answering System using IoT and Cognitive Computing

arxiv: v1 [cs.cr] 30 May 2014

A Survey on improving performance of Information Retrieval System using Adaptive Genetic Algorithm

Survey on Incremental MapReduce for Data Mining

KEYWORDS: Clustering, RFPCM Algorithm, Ranking Method, Query Redirection Method.

Election Analysis and Prediction Using Big Data Analytics

Fabric Image Retrieval Using Combined Feature Set and SVM

Cassandra- A Distributed Database

Batch Inherence of Map Reduce Framework

A Survey on Resource Allocation policies in Mobile ad-hoc Computational Network

Ontology Based Prediction of Difficult Keyword Queries

CLOUD COMPUTING: SEARCH ENGINE IN AGRICULTURE

University of Delaware at Diversity Task of Web Track 2010

Data Clustering. Algorithmic Thinking Luay Nakhleh Department of Computer Science Rice University

Survey Paper on Efficient and Secure Dynamic Auditing Protocol for Data Storage in Cloud

Predefined Object Reduction

PREDICTION OF POPULAR SMARTPHONE COMPANIES IN THE SOCIETY

Transcription:

, Impact Factor- 5.343 Hybrid Approach For Efficient Diversification on Cloud Stored Large Dataset Geetanjali Mohite 1, Prof. Gauri Rao 2 1 Student, Department of Computer Engineering, B.V.D.U.C.O.E, Pune, Maharashtra, India 2 Associate Professor, Department of Computer Engineering, B.V.D.U.C.O.E, Pune, Maharashtra, India Abstract With enormous amount of data being stored and searched on web, quick and efficient search of relevant data has become a challenge. Recently Diversification has been proposed as an approach to facilitate user s to search large data set without having to look through all relevant results. Among different diversification approach Gossip Based Recommendation has become popular because it provides Scalability, automaticity, dynamicity and decentralized control. Here each user has the flexibility to construct cluster of relevant datasets which will be further used in the processing of queries. Due to user involvement the chances of redundancy amongst users increase considerably. In this paper we introduced the usefulness score which will be an indication of the data relevance. This usefulness score is then merged with the distance based access method of Space Partitioning and Probing to further minimize the number of accessed objects. So the objective is to improve performance by accessing only small number of objects which guarantees the quality result in terms of relevance and diversification. Key Words: Diversification, Ranking, Usefulness Score. Introduction Keyword based queries are predominantly used to grasp data due to its ease of use but while keyword query empowers ordinary users to search vast amount of data, the ambiguity of keyword query makes it difficult to effectively answer keyword queries, especially for short and vague keyword queries. To address this challenging problem Diversification was introduced to query large data sets, so that the number of objects accessed while retrieving the relevant data can be minimized. Researchers have proposed many approaches for diversification out of which Gossip Based approach and Space Partitioning and Probing(SPP) are considered the better approaches where individually they have their own benefits like Gossip Based is more focused towards relevant data where as SPP is more focused towards the optimized speed. We in this paper have proposed the system which is a hybrid approach where we have used the usefulness score as the ranking system and merged it with Space Partitioning and Probing mechanism to reduce number of accessed objects with further reduced the computational overhead. Thus moving towards more relevant optimized search results. In this paper to demonstrate our approach we have used real estate dataset because now days, location mapped data has become more and more searched on web. Thus many location based services were launched and the users can upload or create content mapped to places. Real estate directories/website is one good example of this where every data provided by the user or searched by the user is location specific. This motivated us to use Real Estate dataset to demonstrate the diversification in the Hybrid Approach proposed. Existing System In today's world users generate huge amount of diverse data. When a particular user wants to share a data addressed to some particular user groups, then data has to be shared at many different applications and a user who wants to search for the data, has to search many different systems because most of the systems do not communicate amongst themselves. Many systems or approaches have been proposed to address this issue which is a major hindrance for the users to use the distributed http://www.ijmr.net.in email id- irjmss@gmail.com Page 12

, Impact Factor- 5.343 environment for their data sharing mainly because of the lack of trust on data security and non assurance from the service providers that the users looking for the data will always get the relevant information. Facebook is one such example which tried to mitigate this issue by allowing grouping of several accounts and data from different systems (e.g. Facebook enables to regroup DropBox and blogs into a single Facebook account). However, they are limited to a few well known systems. Another existing approach in context of large scale distribution of users and data is data sharing by distributed search and recommendation Proposed System Start Search parameter definition User search criteria (input) usefulness score > 0.5 Exclude data from search dataset Retrieve results Space partitioning for cluster formation Compare user search criteria and cluster Display results Stop In the existing Web world there are numerous systems where user from diverse location and with diverse interest has facility to share data and the users have become heavily dependent on the Web for relevant information search. Introduction of cloud has added more distribution and diversification to the dataset the search engines has to navigate to extract relevant data as per the user search query. There are many methodologies developed by researchers to handle this diversification issue to result in more accurate information and enhance user satisfaction. But our observation is that more the effort has been taken to enhance the accuracy has lead to more complexity and thus more computational overhead and thus inefficient systems. This gap motivated us to design a system which will enhance the system efficiency without compromising on the accuracy of the search results amongst distributed diversified dataset, thus enhancing the user satisfaction. In this paper we have proposed a http://www.ijmr.net.in email id- irjmss@gmail.com Page 13

, Impact Factor- 5.343 hybrid search and recommendation technique to device a system which is efficient and accurate in results. In the proposed search and recommendation approach when a user U submit a query Q, the system sends Q to a subset of users which is actually a cluster of users, say users belonging to specific locations. The Subset will have the usefulness scores against some pre determined parameters, say property location and cost. So the engine will have fewer data objects to search for because the users are clustered and the data objects belonging to that cluster will only be searched and the usefulness score provided by the users who has searched for the similar query Q against the parameters will define the relevance of the search result. This approach will ensure accurate data with efficiency, also with this approach the relevance score of the system will increase incrementally with the system usage. System Architecture Figure 1: Architecture Users Search dataset usefulness score Search Engine with SPP and Diversification parameter Dataset Diversification parameter Assign U-set Search string User Search results Contribution To demonstrate our approach efficiency we will consider the Real Estate Data set where the Diversification parameters in the consideration will be Cost, Area, Location, Purpose, Population and Property type. The successful implementation of our approach will rely on below mentioned steps: If Q is the set of all possible queries (all the combinations of terms), and the probability that a user v can return at least one relevant item given a random query q out of Q. In the following, we first define the coverage with respect to User Set. Then, based on coverage, we express the usefulness of a user v with respect to the other users in the user set. For gossip based recommendation approach we need to have a set of registered users say U-Set. The user profiles should be such that the coverage probability is maximized. Thus a strategy for maximized coverage probability will be devised. For the usefulness score Given u's from U-Set, the usefulness of a user profile v is the probability that it can return relevant items for a random query q, that could not be returned by other users in u's U-Net. the usefulness score should also consider relevance. The usefulness score will be provided by the user http://www.ijmr.net.in email id- irjmss@gmail.com Page 14

IJITE Vol.04 Issue-04, (April, 2016) ISSN: 2321-1776, Impact Factor- 5.343 Then the usefull U-Set Clustering should happen for which we are using Space Partitioning and Probing mechanism where, Bounded diversification with sorted access methods is introduced for the first time and defined formally. The Pull/Bound Maximum Marginal Relevance (PBMMR) family of algorithms will be used, which exploits spatial probing locations and the adaptive alternation of usefulness score-based and distance-based access to reduce the number of fetched objects. An instance of PBMMR, called Space Partitioning and Probing (SPP), is presented, whose pulling strategy uses a tight upper bound. SPP is shown to attain the same diversification quality and exactly the same output as MMR, the most popular result diversification algorithm, but accessing only a fraction of the objects. Algorithm To achieve the desired result we have devised a hybrid algorithm based on Gossip based recommendation and Space Partitioning and Probing(SPP) algorithms. The Gossip based recommendation algorithm primarily focuses on the relevance of the data whereas the SPP focuses on the efficiency of process. The Algorithm is defined below Algorithm 1: Useful U -Net Input: u profile,u-net u (array[1..n]),random View u Output: U-Net u is updated wi th respect to the Random View parameter: unsorted list of user search criteria parameter Random View u - U-Net u ; best Փ,i 0; loop i++; for each p j parameter do score(p j ) usefullness(p j,u, U-Net u (array[1..n])) best arg(max(p j parameter(score(c))); untill i=n or score(best) >score(u-net [i]); if score(best) >score(u-net[i]) then after U-Net u [i..n]; U-Net u [i] best; i++; parameter parameter-best; parameter after U parameter; U-Net u U-Netu - after ; while i < N and parameter<>փ do for each p j parameter loop score(p j ) usefullness(p j,u,u-net u (array[1..i-1]); best arg(max(p j parameter(score(c))); U-Net u [i] best; parameter parameter - best ; i++; end loop; end loop; The first part finds the best useful resultset from the random view, and the position i where it should be inserted in the U-Net. http://www.ijmr.net.in email id- irjmss@gmail.com Page 15

Time(ms) IJITE Vol.04 Issue-04, (April, 2016) ISSN: 2321-1776, Impact Factor- 5.343 The second part copies and deletes the remaining resultset (from position i to N) from the U-Net to the parameters list because their scores need to be recomputed and with respect to the best resultset in parameters. In the last part, the algorithm iteratively computes, for each empty position i in the U-Net the scores. Then, the most useful parameter is moved to the U-Net at that position. Evaluation And Results This system has been tested on Real Estate Dataset from US E-Governance portal. We have provided an interface where user while registering will provide his location, salary range, occupation related details and the real estate database has property type, cost, area, address etc. Our dataset has been categorized based on Area, Location and Cost. Table 1. The dataset summary is as below: Category Segregation No. Of Area 3 1500 Location 5 2000 Cost 3 1350 Property Type 2 3000 The search query consists of keywords specified in the above mentioned category parameters. First, all users gossip during 400 rounds until convergence. Then, every 20 gossip rounds all users submit one of their queri es. The experiment stops at 500 gossip rounds. We measure the average recall results. The recall enables to compute the fraction of items that has been successfully recommended. The Relevance rank has been computed as the percentage of accurate recall result and measured in the scale of 1 to 5. As demonstrated in the below chart a comparative analysis of our hybrid approach and other recommendation systems are computed. We have also considered the results of the SPP and Gossip based approaches individually and found that results are comparable because of accuracy and time space in between the two individual approaches. Figure 2. Performance Measure against different diversification percentage 6 5 4 3 2 1 0 10 20 30 40 50 Diversification % MMR WUME SPP Gossip Hybrid http://www.ijmr.net.in email id- irjmss@gmail.com Page 16

Relevance Rank IJITE Vol.04 Issue-04, (April, 2016) ISSN: 2321-1776, Impact Factor- 5.343 Figure 3. Recall Result Accuracy Measure against different diversification percentage 0.8 0.6 0.4 0.2 0 10 20 30 40 50 Diversification % MMR WUME SPP Gossip Hybrid Snapshots Figure 4. Login Page http://www.ijmr.net.in email id- irjmss@gmail.com Page 17

, Impact Factor- 5.343 Figure 5. Home Page Figure 6. Search Page http://www.ijmr.net.in email id- irjmss@gmail.com Page 18

, Impact Factor- 5.343 Figure 7. Result Page Conclusion The proposed system uses a hybrid approach to gain a better output result on search of diversified dataset based on gossip based search and recommendation by the combination of relevance and diversity and generating the usefulness score. This score is further used by the Space Partitioning and Probing algorithm to enhance the performance of the search operation. Our approach was evaluated against the Real Estate Dataset from US E-Governance portal. The resultant result was observed to be twice more accurate and the performance enhancement by 40 percent. This work can further be enhanced and implemented for multiple sites. References Hasan, M., Mueen, A., & Tsotras, V. Distributed Diversification of Large Datasets. Servajean, M., Pacitti, E., Liroz-Gistau, M., Amer-Yahia, S., & El Abbadi, A. (2014). Exploiting Diversification in Gossip-Based Recommendation. In Data Management in Cloud, Grid and P2P Systems (pp. 25-36). Springer International Publishing. Fraternali, P., Martinenghi, D., & Tagliasacchi, M. (2012, May). Top-k bounded diversification. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (pp. 421-432). ACM. Patil, M., & Rao, G. R. (2014). Integrity verification in multi-cloud storage using cooperative provable data possession. International Journal of Computer Science & Information Technology, 5(2), 982-985. Carretero, J., Isaila, F., Kermarrec, A. M., Taïani, F., & Tirado, J. M. (2012, June). Geology: Modular georecommendation in gossip-based social networks. In Distributed Computing Systems (ICDCS), 2012 IEEE 32nd International Conference on (pp. 637-646). IEEE. http://www.ijmr.net.in email id- irjmss@gmail.com Page 19

, Impact Factor- 5.343 Patil, M. M., & Rao, G. R. Security in Multi-Cloud Storage with Cooperative Provable Data Possession. Magureanu, S., Dokoohaki, N., Mokarizadeh, S., & Matskin, M. (2012). Design and Analysis of a Gossip-based Decentralized Trust Recommender System. In4th ACM Recsys Workshop on Recommender Systems and the Social Web. Demidova, E., Fankhauser, P., Zhou, X., & Nejdl, W. (2010, July). DivQ: diversification for keyword search over structured databases. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval (pp. 331-338). ACM. http://www.ijmr.net.in email id- irjmss@gmail.com Page 20