, Impact Factor- 5.343 Hybrid Approach For Efficient Diversification on Cloud Stored Large Dataset Geetanjali Mohite 1, Prof. Gauri Rao 2 1 Student, Department of Computer Engineering, B.V.D.U.C.O.E, Pune, Maharashtra, India 2 Associate Professor, Department of Computer Engineering, B.V.D.U.C.O.E, Pune, Maharashtra, India Abstract With enormous amount of data being stored and searched on web, quick and efficient search of relevant data has become a challenge. Recently Diversification has been proposed as an approach to facilitate user s to search large data set without having to look through all relevant results. Among different diversification approach Gossip Based Recommendation has become popular because it provides Scalability, automaticity, dynamicity and decentralized control. Here each user has the flexibility to construct cluster of relevant datasets which will be further used in the processing of queries. Due to user involvement the chances of redundancy amongst users increase considerably. In this paper we introduced the usefulness score which will be an indication of the data relevance. This usefulness score is then merged with the distance based access method of Space Partitioning and Probing to further minimize the number of accessed objects. So the objective is to improve performance by accessing only small number of objects which guarantees the quality result in terms of relevance and diversification. Key Words: Diversification, Ranking, Usefulness Score. Introduction Keyword based queries are predominantly used to grasp data due to its ease of use but while keyword query empowers ordinary users to search vast amount of data, the ambiguity of keyword query makes it difficult to effectively answer keyword queries, especially for short and vague keyword queries. To address this challenging problem Diversification was introduced to query large data sets, so that the number of objects accessed while retrieving the relevant data can be minimized. Researchers have proposed many approaches for diversification out of which Gossip Based approach and Space Partitioning and Probing(SPP) are considered the better approaches where individually they have their own benefits like Gossip Based is more focused towards relevant data where as SPP is more focused towards the optimized speed. We in this paper have proposed the system which is a hybrid approach where we have used the usefulness score as the ranking system and merged it with Space Partitioning and Probing mechanism to reduce number of accessed objects with further reduced the computational overhead. Thus moving towards more relevant optimized search results. In this paper to demonstrate our approach we have used real estate dataset because now days, location mapped data has become more and more searched on web. Thus many location based services were launched and the users can upload or create content mapped to places. Real estate directories/website is one good example of this where every data provided by the user or searched by the user is location specific. This motivated us to use Real Estate dataset to demonstrate the diversification in the Hybrid Approach proposed. Existing System In today's world users generate huge amount of diverse data. When a particular user wants to share a data addressed to some particular user groups, then data has to be shared at many different applications and a user who wants to search for the data, has to search many different systems because most of the systems do not communicate amongst themselves. Many systems or approaches have been proposed to address this issue which is a major hindrance for the users to use the distributed http://www.ijmr.net.in email id- irjmss@gmail.com Page 12
, Impact Factor- 5.343 environment for their data sharing mainly because of the lack of trust on data security and non assurance from the service providers that the users looking for the data will always get the relevant information. Facebook is one such example which tried to mitigate this issue by allowing grouping of several accounts and data from different systems (e.g. Facebook enables to regroup DropBox and blogs into a single Facebook account). However, they are limited to a few well known systems. Another existing approach in context of large scale distribution of users and data is data sharing by distributed search and recommendation Proposed System Start Search parameter definition User search criteria (input) usefulness score > 0.5 Exclude data from search dataset Retrieve results Space partitioning for cluster formation Compare user search criteria and cluster Display results Stop In the existing Web world there are numerous systems where user from diverse location and with diverse interest has facility to share data and the users have become heavily dependent on the Web for relevant information search. Introduction of cloud has added more distribution and diversification to the dataset the search engines has to navigate to extract relevant data as per the user search query. There are many methodologies developed by researchers to handle this diversification issue to result in more accurate information and enhance user satisfaction. But our observation is that more the effort has been taken to enhance the accuracy has lead to more complexity and thus more computational overhead and thus inefficient systems. This gap motivated us to design a system which will enhance the system efficiency without compromising on the accuracy of the search results amongst distributed diversified dataset, thus enhancing the user satisfaction. In this paper we have proposed a http://www.ijmr.net.in email id- irjmss@gmail.com Page 13
, Impact Factor- 5.343 hybrid search and recommendation technique to device a system which is efficient and accurate in results. In the proposed search and recommendation approach when a user U submit a query Q, the system sends Q to a subset of users which is actually a cluster of users, say users belonging to specific locations. The Subset will have the usefulness scores against some pre determined parameters, say property location and cost. So the engine will have fewer data objects to search for because the users are clustered and the data objects belonging to that cluster will only be searched and the usefulness score provided by the users who has searched for the similar query Q against the parameters will define the relevance of the search result. This approach will ensure accurate data with efficiency, also with this approach the relevance score of the system will increase incrementally with the system usage. System Architecture Figure 1: Architecture Users Search dataset usefulness score Search Engine with SPP and Diversification parameter Dataset Diversification parameter Assign U-set Search string User Search results Contribution To demonstrate our approach efficiency we will consider the Real Estate Data set where the Diversification parameters in the consideration will be Cost, Area, Location, Purpose, Population and Property type. The successful implementation of our approach will rely on below mentioned steps: If Q is the set of all possible queries (all the combinations of terms), and the probability that a user v can return at least one relevant item given a random query q out of Q. In the following, we first define the coverage with respect to User Set. Then, based on coverage, we express the usefulness of a user v with respect to the other users in the user set. For gossip based recommendation approach we need to have a set of registered users say U-Set. The user profiles should be such that the coverage probability is maximized. Thus a strategy for maximized coverage probability will be devised. For the usefulness score Given u's from U-Set, the usefulness of a user profile v is the probability that it can return relevant items for a random query q, that could not be returned by other users in u's U-Net. the usefulness score should also consider relevance. The usefulness score will be provided by the user http://www.ijmr.net.in email id- irjmss@gmail.com Page 14
IJITE Vol.04 Issue-04, (April, 2016) ISSN: 2321-1776, Impact Factor- 5.343 Then the usefull U-Set Clustering should happen for which we are using Space Partitioning and Probing mechanism where, Bounded diversification with sorted access methods is introduced for the first time and defined formally. The Pull/Bound Maximum Marginal Relevance (PBMMR) family of algorithms will be used, which exploits spatial probing locations and the adaptive alternation of usefulness score-based and distance-based access to reduce the number of fetched objects. An instance of PBMMR, called Space Partitioning and Probing (SPP), is presented, whose pulling strategy uses a tight upper bound. SPP is shown to attain the same diversification quality and exactly the same output as MMR, the most popular result diversification algorithm, but accessing only a fraction of the objects. Algorithm To achieve the desired result we have devised a hybrid algorithm based on Gossip based recommendation and Space Partitioning and Probing(SPP) algorithms. The Gossip based recommendation algorithm primarily focuses on the relevance of the data whereas the SPP focuses on the efficiency of process. The Algorithm is defined below Algorithm 1: Useful U -Net Input: u profile,u-net u (array[1..n]),random View u Output: U-Net u is updated wi th respect to the Random View parameter: unsorted list of user search criteria parameter Random View u - U-Net u ; best Փ,i 0; loop i++; for each p j parameter do score(p j ) usefullness(p j,u, U-Net u (array[1..n])) best arg(max(p j parameter(score(c))); untill i=n or score(best) >score(u-net [i]); if score(best) >score(u-net[i]) then after U-Net u [i..n]; U-Net u [i] best; i++; parameter parameter-best; parameter after U parameter; U-Net u U-Netu - after ; while i < N and parameter<>փ do for each p j parameter loop score(p j ) usefullness(p j,u,u-net u (array[1..i-1]); best arg(max(p j parameter(score(c))); U-Net u [i] best; parameter parameter - best ; i++; end loop; end loop; The first part finds the best useful resultset from the random view, and the position i where it should be inserted in the U-Net. http://www.ijmr.net.in email id- irjmss@gmail.com Page 15
Time(ms) IJITE Vol.04 Issue-04, (April, 2016) ISSN: 2321-1776, Impact Factor- 5.343 The second part copies and deletes the remaining resultset (from position i to N) from the U-Net to the parameters list because their scores need to be recomputed and with respect to the best resultset in parameters. In the last part, the algorithm iteratively computes, for each empty position i in the U-Net the scores. Then, the most useful parameter is moved to the U-Net at that position. Evaluation And Results This system has been tested on Real Estate Dataset from US E-Governance portal. We have provided an interface where user while registering will provide his location, salary range, occupation related details and the real estate database has property type, cost, area, address etc. Our dataset has been categorized based on Area, Location and Cost. Table 1. The dataset summary is as below: Category Segregation No. Of Area 3 1500 Location 5 2000 Cost 3 1350 Property Type 2 3000 The search query consists of keywords specified in the above mentioned category parameters. First, all users gossip during 400 rounds until convergence. Then, every 20 gossip rounds all users submit one of their queri es. The experiment stops at 500 gossip rounds. We measure the average recall results. The recall enables to compute the fraction of items that has been successfully recommended. The Relevance rank has been computed as the percentage of accurate recall result and measured in the scale of 1 to 5. As demonstrated in the below chart a comparative analysis of our hybrid approach and other recommendation systems are computed. We have also considered the results of the SPP and Gossip based approaches individually and found that results are comparable because of accuracy and time space in between the two individual approaches. Figure 2. Performance Measure against different diversification percentage 6 5 4 3 2 1 0 10 20 30 40 50 Diversification % MMR WUME SPP Gossip Hybrid http://www.ijmr.net.in email id- irjmss@gmail.com Page 16
Relevance Rank IJITE Vol.04 Issue-04, (April, 2016) ISSN: 2321-1776, Impact Factor- 5.343 Figure 3. Recall Result Accuracy Measure against different diversification percentage 0.8 0.6 0.4 0.2 0 10 20 30 40 50 Diversification % MMR WUME SPP Gossip Hybrid Snapshots Figure 4. Login Page http://www.ijmr.net.in email id- irjmss@gmail.com Page 17
, Impact Factor- 5.343 Figure 5. Home Page Figure 6. Search Page http://www.ijmr.net.in email id- irjmss@gmail.com Page 18
, Impact Factor- 5.343 Figure 7. Result Page Conclusion The proposed system uses a hybrid approach to gain a better output result on search of diversified dataset based on gossip based search and recommendation by the combination of relevance and diversity and generating the usefulness score. This score is further used by the Space Partitioning and Probing algorithm to enhance the performance of the search operation. Our approach was evaluated against the Real Estate Dataset from US E-Governance portal. The resultant result was observed to be twice more accurate and the performance enhancement by 40 percent. This work can further be enhanced and implemented for multiple sites. References Hasan, M., Mueen, A., & Tsotras, V. Distributed Diversification of Large Datasets. Servajean, M., Pacitti, E., Liroz-Gistau, M., Amer-Yahia, S., & El Abbadi, A. (2014). Exploiting Diversification in Gossip-Based Recommendation. In Data Management in Cloud, Grid and P2P Systems (pp. 25-36). Springer International Publishing. Fraternali, P., Martinenghi, D., & Tagliasacchi, M. (2012, May). Top-k bounded diversification. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (pp. 421-432). ACM. Patil, M., & Rao, G. R. (2014). Integrity verification in multi-cloud storage using cooperative provable data possession. International Journal of Computer Science & Information Technology, 5(2), 982-985. Carretero, J., Isaila, F., Kermarrec, A. M., Taïani, F., & Tirado, J. M. (2012, June). Geology: Modular georecommendation in gossip-based social networks. In Distributed Computing Systems (ICDCS), 2012 IEEE 32nd International Conference on (pp. 637-646). IEEE. http://www.ijmr.net.in email id- irjmss@gmail.com Page 19
, Impact Factor- 5.343 Patil, M. M., & Rao, G. R. Security in Multi-Cloud Storage with Cooperative Provable Data Possession. Magureanu, S., Dokoohaki, N., Mokarizadeh, S., & Matskin, M. (2012). Design and Analysis of a Gossip-based Decentralized Trust Recommender System. In4th ACM Recsys Workshop on Recommender Systems and the Social Web. Demidova, E., Fankhauser, P., Zhou, X., & Nejdl, W. (2010, July). DivQ: diversification for keyword search over structured databases. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval (pp. 331-338). ACM. http://www.ijmr.net.in email id- irjmss@gmail.com Page 20