DIVERSIFIED DATASET EXPLORATION BASED ON USEFULNESS SCORE Geetanjali Mohite 1, Prof. Gauri Rao 2 1 Student, Department of Computer Engineering, B.V.D.U.C.O.E, Pune, Maharashtra, India 2 Associate Professor, Department of Computer Engineering, B.V.D.U.C.O.E, Pune, Maharashtra, India ABSTRACT: Data sharing is the key cause for the burst in web usage in today's context. Users have become large producers of diverse data which can be stored in data spaces distributed in different systems. Thus the data sharing and then searching for data amongst such diversified data distributed across many system has become difficult. In the scenario of huge distribution of users and system for the diversified data many approaches were proposed. In this paper we have discussed the Gossip Based recommendation approach for searching the useful data for the user. Further we have suggested enhancement in the approach for more relevant search result and with efficiency of the approach measured across the usefulness of the result with better performance. Keywords: Usefulness Score, Space Partitioning and Probing, Diversification, Distributed Environment [1] INTRODUCTION Data diversification has recently attained substantial attention because of increased user confidence in Recommender Systems (RS) due to which satisfaction has improved amongst users, as well as in online and backend search. The huge amount of information available on web creates the need for developing methods towards selecting and presenting to the user specific subgroups. Recently, data diversification has sought considerable attention as a way of increased user satisfaction. Data diversification has different forms consisting of selecting items so that their novelty, coverage, or content dissimilarity is maximized[6]. Existing approaches to data diversification is divided into two categories 1. Greedy heuristics 2. Interchange heuristics Greedy heuristics (e.g., [11]) creates a diverse dataset incrementally, considering one item at a time so that some distance function is maximized, whereas interchange heuristics (e.g., [10]) start from a random initial dataset and takes effort to improve it. Applying indexing to diversification is also an approach proposed by many researchers. One such approach is, a Dewey-based tree which is used for structure based diversity, which uses priorities of attribute and the second approach is 33
DIVERSIFIED DATASET EXPLORATION BASED ON USEFULNESS SCORE spatial indexing which exploits the location of nearest neighbors of an item that are the most far away to each other. In spite of the immense interest in diversification in recent years, most previous researches study and address the static nature of the problem, which is, the available items out of which a diverse subset is selected which do not change with time. As a solution to the above mentioned challenges a simple solution to data sharing is offered by distributed search and recommendation. To be specific it is gossip-based search and recommendation where every user constructs a cluster of "relevant" data that will be employed in the processing of queries. However, considering only usefulness introduces a significant amount of data duplicity among users. In the system when a query is submitted, As the user profiles in each user's cluster are quite similar, the probability of retrieving the same set of relevant items increases, and recall results are limited. Thus a modified version of the gossip based recommendation is discussed in this paper where for enhancing the relevance of the data retrieved the "usefulness" score is introduced and for performance efficiency the space partitioning and probing algorithm is used in conjunction. [2] EXISTING SYSTEM OVERVIEW Recommender Systems (RS) is considered as a reference to guide users in the task of speedily browse/explore large product space, assisting users to identify interesting products in an optimized way. However, common RS usually do not provide diverse results though it is considered that diversity is a required feature. The study of diversity aware RS has become an important research area in recent years, inspired from diversified solutions for Information Retrieval (IR). Diversity is a concept that has been applied in many fields; mostly with the goal of obtaining a set of objects that have a high level of dissimilarity between them, and that as a group, maximize a quality criterion. However, there is usually a trade-off between diversity and quality. Hence, the diversification problem is how to choose k elements from a set that maximizes diversity at a low quality sacrifice. Diversification approaches for both RS and IR can be classified as implicit or explicit. In IR, implicit approaches infer that by selecting dissimilar documents the diverse query aspects will be indirectly covered. The method MMR (Carbonell et al. 1998) is a classic example that aims to maximize relevant novelty : weighted linear combination 484 of relevance and novelty (novelty is defined as dissimilarity from previously selected documents). In contrast, explicit approaches directly attempt to cover different query aspects or sub-topics. IA-Select (Agrawal et al. 2009) and xquad (Santos et al. 2010) are examples of explicit approaches. In addition, (Zheng et al. 2012) proposed strategies to specify coverage functions of query sub-topics that serve as a basis for their diversification solution. [3] COMPAARATIVE ANALYSIS The comparison of different approaches has been done against the below parameters in this paper. (a) Greedy optimization means that the query should be result hungry and the extensive search should be performed for the expected result.
(b) Explicit approach should propose solution directly attempting to cover the diverse aspects of the query/user profile. (c) Implicit approach means the proposed solution explicitly prevent redundancy within the results. (d) Control of diversity vs. Relevance trade-off asks question that is there a control parameter that can tune the diversity vs. relevance trade-off. (e) Encourages discovery identifies that does the proposed approach not penalize novel/serendipitous items. (f) Control of exploitation vs. Exploration trade-off answers that is there a control parameter that can tune the exploitation vs. exploration trade-off. Below Table shows the comparison of the approaches proposed previously: 1 2 3 4 5 6 7 8 Greedy Optimization Y Y Y Y Y Y Y N Explicit Approach N Y Y Y N N Y N Implicit Approach Y N N N Y Y N N Control of diversity vs. relevance trade-off Y N Y Y Y Y - - Encourages Discovery - N N N - - N - Control of exploitation vs. exploration trade-off N N N N N N N N 1 - (Carbonell et al. 1998) 2 - (Agrawal et al. 2009) 3 - (Santos et al. 2010) 4 - (Zheng et al. 2012) 5 - (Smyth et al. 2001) 6 - (Ziegler et al. 2005) 7 - (Vargas 2012) 8 - (Adomavicius et al. 2009) Table 1 Comparison of different approaches [4] OUR APPROACH - USEFULNESS SCORE BASED EXPLORATION OF DIVERSIFIED DATASET In the existing Web world there are numerous systems where user from diverse location and with diverse interest has facility to share data and the users have become heavily dependent on the Web for relevant information search. Introduction of cloud has added more distribution and diversification to the dataset the search engines has to navigate to extract relevant data as per the user search query. For the discussion of our modified approaches methodology we have considered the Real Estate Data set where the Diversification parameters in the consideration will be Cost, Area, Location and Property type. If Q is the set of all possible queries (all the combinations of terms), and the probability that a user v can return at least one relevant item given a random query q out of Q. In the following, we first define the coverage with respect to User Set. Then, based on coverage, we express the usefulness of a user v with respect to the other users in the user set. 35
DIVERSIFIED DATASET EXPLORATION BASED ON USEFULNESS SCORE For gossip based recommendation approach we need to have a set of registered users say U- Set. The user profiles should be such that the coverage probability is maximized. Thus a strategy for maximized coverage probability will be devised. For the usefulness score Given u's from U-Set, the usefulness of a user profile v is the probability that it can return relevant items for a random query q, that could not be returned by other users in u's U-Net. The usefulness score should also consider relevance. The usefulness score will be provided by the user Then the useful U-Set Clustering should happen for which we are using Space Partitioning and Probing mechanism where, Bounded diversification with sorted access methods is introduced for the first time and defined formally. The Pull/Bound Maximum Marginal Relevance (PBMMR) family of algorithms will be used, which exploits spatial probing locations and the adaptive alternation of usefulness score-based and distance-based access to reduce the number of fetched objects. An instance of PBMMR, called Space Partitioning and Probing (SPP), is presented, whose pulling strategy uses a tight upper bound. SPP is shown to attain the same diversification quality and exactly the same output as MMR, the most popular result diversification algorithm, but accessing only a fraction of the objects. Data Owne 1. Divers ified Datas Submits Data To be available for Search With Web Server 1. Searc h Engin U s e Search Results Provided Search User Search Query is formed and Pushed to Modified Usefulness Score Sent Figure. 1 System architecture The architecture used to demonstrate the working of our system is distributed web based i.e. the diversified dataset will reside on the centralized web server with the web based application also hosted on the cloud web server. The user will be able to access the hosted site from any cloud enabled environment. The search query will be submitted by user and the Query engine will interpret the query and based on logged in users group and parameters like income and area the usefulness score will be derived and search results will be provided to users. for the search results
the user will have the facility to specify the usefulness score and thus the usefulness score will be recomputed and persisted in database for next search query by other users in the same group. REFERENCES [1] Adomavicius, G., & Kwon, Y., Toward more diverse recommendations: Item re-ranking methods for recommender systems. InWorkshop on Information Technologies and Systems. (2009, December). [2] Agrawal, R., Gollapudi, S., Halverson, A., & Ieong, S. Diversifying search results. In Proceedings of the Second ACM International Conference on Web Search and Data Mining (pp. 5-14). ACM, (2009, February) [3] Carbonell, J., & Goldstein, J., The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval (pp. 335-336). ACM, (1998, August). [4] Drosou, M., & Pitoura, E. Search result diversification. ACM SIGMOD Record, 39(1), 41-47, (2010) [5] Haritsa, J. R., The KNDN Problem: A Quest for Unity in Diversity. IEEE Data Eng. Bull., 32(4), 15-22, (2009). [6] Santos, R. L., Macdonald, C., & Ounis, I., Exploiting query reformulations for web search result diversification. In Proceedings of the 19th international conference on World wide web (pp. 881-890). ACM, (2010, April) [7] Smyth, B., & McClave, P. Similarity vs. diversity. In Case-Based Reasoning Research and Development (pp. 347-361). Springer Berlin Heidelberg, (2001). [8] Vargas, S., Novelty and Diversity Enhancement and Evaluation in Recommender Systems. MSc. diss., Department of Ingeniería Informática, Universidad Autónoma de Madrid, Spain. (2012). [9] Vee, E., Srivastava, U., Shanmugasundaram, J., Bhat, P., & Yahia, S. A., Efficient computation of diverse query results. In Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on (pp. 228-236). IEEE. (2008, April). [10] Yu, C., Lakshmanan, L., & Amer-Yahia, S. It takes variety to make a world: diversification in recommender systems. In Proceedings of the 12th international conference on extending database technology: Advances in database technology (pp. 368-378). ACM. (2009, March). [11] Ziegler, C. N., McNee, S. M., Konstan, J. A., & Lausen, G. Improving recommendation lists through topic diversification. In Proceedings of the 14th international conference on World Wide Web (pp. 22-32). ACM. (2005, May) [12] Zheng, W., Wang, X., Fang, H., & Cheng, H., Coverage-based search result diversification. Information Retrieval, 15(5), 433-457, (2012). 37