A Data Preprocessing Framework of Geoscience Data Sharing Portal for User Behavior Mining

Size: px

Start display at page:

Download "A Data Preprocessing Framework of Geoscience Data Sharing Portal for User Behavior Mining"

Willis Daniels
5 years ago
Views:

1 A Data Preprocessing Framework of Geoscience Data Sharing Portal for User Behavior Mining Mo Wang,,2, Juanle Wang,,3' 1 State Key Laboratory of Resources and Environmental Information System, Institute of Geographic Sciences and Natural Resources Research Chinese Academy of Sciences, Beijing, China 2 College of Resources and Environment University Chinese Academy of Sciences, Beijing, China 3 Jiangsu Center for Collaborative Innovation in Geographical Information Resource Development and Application, Nanjing, China *Corresponding author, wangil@igsnrr.ac.cn Abstract-Science data sharing has many advantages for both scientific research and education. Knowing about behaviors of science data sharing participants is valuable to support informed decision making on data sharing policy and data sharing website design. Nowadays, data sharing is mainly carried through the Internet, and web usage mining provides an ideal approach to uncover user behaviors of data sharing. This paper presents a data preprocessing framework for further user behavior mining of a geoscience data sharing portal (geodata.cn). The preprocessing steps included data cleaning, user identification, session identification, and data modeling. Web server logs served as the major data source of this study. Heuristic algorithms were employed to accomplish data cleaning and user identification. Different session identification methods were applied for comparison. Users' geolocation were identified using an online Geo-IP lookup tool, which provides geographical coordinates of an IP address. On the basis of all the preprocessing procedures, a web usage data model of science data sharing portal were proposed for further user behavior mining, such as user classification and spatial association rules mining. Keywords-geoscience data sharing; web usage mining; spatial data mining; data preprocessing I. INTRODUCTION Data has been seen as basic infrastructure of science. Data sharing has many advantages that boost scientific research and education. Data sharing in science community has a long history, usually in ad hoc ways [1]. However development of information technology and internet endows new conceptions and fashion of data sharing. Data sharing is nowadays conducted mainly through internet. Hence data sharing behaviors are becoming web usage behaviors of data sharing portals. Geodata.cn is a leading data sharing portal in earth system science in China. It has a large number of users and abundant data resources across Earth System Science disciplines [2]. Yet the user behaviors, which can be interpreted as data sharing behavior from end users perspective, are in state of lack of knowledge. User behavior mining of a web site pertains to the field of web usage mining, which is a subfield of Web mining. Web mining is a field that fulfills knowledge discovery from the Internet. More precisely, the field is often categorized as the following three topics, Web Content Mining, Web Structure Mining and Web Usage Mining [3]. Output of Web Usage Mining could be of great value in network structure optimization and website server configuration [4]. Moreover extracted user behavior can be further used in recommendation system and proactive service in context of data sharing. This study aims to set up a data preprocessing framework for user behavior mining of science data sharing portal in the context of web usage mining and spatial data mining. A. Background II. DATA AND METHOD The subject of this study is user behaviors of National Data Sharing Platform of Earth System Science (Geodata.cn). It is one of the National Science & Technology Infrastructure platforms dedicating in science data sharing. The objective of the platform is to provide data support and service for researches in Earth System Science and for pioneering innovation across relevant disciplines. Geodata.cn has been operating for nearly 10 years and has a representative position in China and even in the world in science data sharing domain. By the end of Aguste 2014, registered users of the platform reaches 91,944, total visits to the portal is 17 million [5]. These numbers left abundant data recorded by the website servers. Thereby user behaviors of the platform can be mined and analyzed with data mining methods. B. Data Data sources of Web Usage Ming are mainly from Web servers, besides data from proxy servers and Web clients can also be utilized if available. This study used web server log data since the other two were not available. A web server log of two months (July and August, 2014) were acquired for this study. The web server log was stored in Common Log Format. Fig. 1 is an example of a log entry, from which information of user's IP, visiting time, method, URL visited, status, referrer, and client details can be acquired. The log file of this study contains 1.69 million log entries in the form of that example. Tab. I lists the information retrieved from a log entry. Supported by National Technology Infrastructure -Data Sharing Platform of Earth System Science, Special Informational Infrastructure Program ofcas(xxhi2504-i-oi)

$.I.2S 22 49 92 - - [,5/A:Jg/2014.I.e 2E J.4.. 12CO "GE':' IPortal/Samp... epr-e\ le.."ld=ll I.l.C l -.I.3C39 H':'TP/:'.I. " 2)... l lc234 "h'.:.tp 11.. -.. -.. geoda'.:.a cn/porta.$

2 .I.2S [,5/A:Jg/2014.I.e 2E J CO "GE':' IPortal/Samp... epr-e\ le.."ld=ll I.l.C l -.I.3C39 H':'TP/:'.I. " 2)... l lc234 "h'.:.tp geoda'.:.a cn/porta./metadata/llstmeta data ) sp"category=.1. o3&oraer=order%20ny%2 Oglcnalld%20desc&&&pn=2" "r-loz-llal S C ( lr.do.. s NT 6.I., OWE4) Apple ebk.'.:./5r 36 (KH':'ML, llke Gecko) Chr-ome/32 C.I. O) 2 Safar-_/53-36",,_to Figure I. A log entry example TABLE T. COMPONENTS OF A LOG ENTRY IP Time 05/Aug/2014:10:26: Method GET URL lextralres/libs/kendo/extensionslkendo.extension.ui..is Protocol HTTP/l.l Status 200 File size(bvte) IS072 Referrer Client MozillalS.O (Windows NT 6.3; WOW64; rv:31.0) Gecko/ Firefoxl31.0 C. Method Data fusion and data cleaning For large scale websites, user information may come from multiple Web server or Application server. Data fusion is the process of merging logs from multiple Web servers or Application servers. Data cleaning aims to eliminate irrelevant and redundant records for the analysis, e.g. requests for graphical page content (.jpg,.png,.gif, and et.al.), style.css file, voice file, etc. [6]. In addition, requests from web crawlers (or robots) and error requests should also be removed from original log. Requests of graphical content, style file and error requests are easy to eliminate owing to that they can be identified from URL request field and statues field. However navigation patterns of robots and Web crawlers are sometimes hard to identify if robots use a fake user agent. In this case, developing a heuristic navigation behavior to imitate robots' behavior is often used in studies, for instance, in the work of Tan and Kumar [7]. The algorithm employed for data cleaning is described as: Input: Rawlog II source web server log file Output: Logbase Ilcleaned log database Begin LogEnlry= Read(Rawlog); If not (LogRecordRequesl. url.conlains(.gif, jpeg,jpg,. css, js) or LogRecordSlalus(> 299, <200) or LogRecordAgent=(Crawler,spider, robot)) Then write(logbase, LogRecord); End User Identification User Identification is the process of distinguishing. different users. If without authentication mechanism, the best source to identify user is through cookies. However, cookies from agent is often disabled by users and some websites do not use cookies. Another useful information is users' IP address. Yet the IP address alone is not sufficient to map log entries to unique users. This is due to proliferation of ISP proxy servers which assign rotating IP address to clients as they browse the Web [3]. If cookies is not available the agent and referrer in log entries can provide auxiliary information to identify users. A heuristic method was devised to achieve user identification: Step 1, assume a new IP address represents a new user. Step 2, for multiple log entries that share a same IP, if their Internet browser or Operating System is different it means they are different users. Step 3, for the users identified by the above two steps, if a URL request of a user cannot be linked to by any hyperlinks of the user's visited pages, a new user exists. Once individual users are identified, the geographical location can be determined by IP address. A GeoIP lookup service provided by ipinfo.io [9] was used to acquire geolocation of users. Session identification Session identification is the process of dividing each user's page access activities into sessions [10]. Each session represents a visit to the website. Websites without authorization mechanism or embedded session ID system have to rely on heuristic method to complete session identification [3]. The simplest, but often useful, method to achieve this is through a time window, where if the time between page requests exceeds a certain limit, it is assumed that a new session begins. A previous study in heuristic algorithms of session identification by Berendt and Mobasher [11] compared three heuristic method under frame-based and frame-free site structures. Results showed that Referrer-based heuristic algorithm (Hret) outperformed the other two in frame-free circumstance. With respect to that result the Href algorithm was adopted in this study. The core of the algorithm is described as following: presume p and q are two consecutive page requests, and p belongs to session S. Let tp and tq denote the timestamps of p and q, respectively. Then, q will belong to session S if the referrer of q was previously invoked within S, or if (tq-tp)!:::,., for a specific delay!:::" where the referrer is undefmed (" " in the log). Otherwise, a new session is constructed - that embodies q. For comparison purpose, Href algorithm and a time window based method were tested simultaneously. The time window based heuristic method (Htwin) [12] is described as following: Step 1, if a new user emerges, generate a new session. Step 2, within the sessions identified by step 1, if the referrer of a log entry is "_" it is assumed that a new session starts. Step 3, with the sessions identified by step 1 and step 2, if the time interval of a log entry and its previous one exceeds a threshold (30 min), a new session starts.

Data formatting and data modeling In this study, cleaned log were written into MySQL database, so were the user ID, user geolocation and session ID.

For example, the temporal information is not necessary for user cluster mining and association rules mining, the data fonnatting model will not read time stamp from the log entries.

3 Data formatting and data modeling In this study, cleaned log were written into MySQL database, so were the user ID, user geolocation and session ID. Once these data is stored in database, a final data fonnatting model is demanded for specific data mining to be accomplished [13]. For example, the temporal information is not necessary for user cluster mining and association rules mining, the data fonnatting model will not read time stamp from the log entries. The final step of data preparation for data mining is to construct a proper mathematical data model. A geo-referenced data model based on traditional user-pageview matrix data model [8] is hereby proposed in the next section. A. Results Data cleaning III. RESULTS AND DISCUSSION The raw log file has 1,694,561 entries. 451,544 are left after data cleaning, nearly accounting for 114 of the whole. It can be concluded that most of the raw log data is redundant for user behavior mining. The script of data cleaning algorithm was written with Python. Raw log entry was read by the script and filtered by checking the status, client, URL and fmally exported to MySQL data base. A screenshot of resulted log table in the database is shown by Fig. 2. from major cities, for instance, Beijing, Shanghai, and Wuhan. This is in accordance with distribution of Universities and research institutes. Figure 3. JSON response of a GeolP request from ipinfo.io Session Identification With regard to session identification, results of Htwin method and Href method showed significant difference. Sessions identified by Htwin was 115,517, while that number of Href method was 56,211, only about a half of sessions identified by Htwin. This can be explained by that quite a large proportion of undefined referrer ("_") exist in the log, which lead to overestimated sessions with the Htwin method. Identified users and sessions are stored in a table with log entry ID, which is shown in Fig. 4. Overall, Tab. II shows the results of the preprocessing steps. Figure 2. A screenshot of log entry table in the database User Identification The GeoIP lookup service provides a JSON API, which can easily be built into a script with returning geolocation infonnation in JSON. Fig. 3 illustrates a JSON response of a GeoIP request. Due to dynamic IP allocated by network access providers, the returned longitude and latitude cannot precisely identify a user's location, but a region that the user located at sub-city scale. With the 451,544 valuable log entries, 14,549 users were identified, amongst which 13,786 geolocations were identified with the GeoIP lookup API. Fig. 5 is a map that depicts user distribution worldwide. Users are majorly from three regions, i.e. China, United States and Europe, and most of them are from China. To zoom in to China, users are mainly Figure 4. A screenshot of session table in the database TABLE 2. RESULTS OF PREPROCESSING

4 Figure 5. User distribution of July and August, 2014 Data formatting and data modeling As the last step, a refined data modeling is proposed as following_ Given a set of n pageviews after data cleaning, P = {Pv Pz,..., Pn} and a set of m transactions T = {tv tz,..., tn}, where t, in T is a subset of P. Each transaction can be denoted as an I-length sequence of order pairs. t = {(pi, w(pd), (p, w(pd),..., (pt, w(pf))} (1) where each pi = Pi for some j in {I, 2, "', n}, and wept) is the weight associated with pageview pi in transaction t, representing its significance. wept) can be a binary 1 or 0, representing existence or non-existence of a pageview, or can be the time spent on the page, depending on the data mining task. Given the transaction t above, a transaction vector tv can be defined as: where each W i =w( p J), for some j in {l, 2, "', n}, if Pj exists in transaction t. Otherwise, W = O. i Thus, the set of all user transactions can be modeled as an m x n user-pageview matrix, as an example shown with Fig. 6. (2) Sessions / users usero user1 user2 user3 user4 user5 user6 user7 users user9 (" Pageviews A B C D E F '\ Figure 6. An example of a user-pageview (transaction) matrix [8]. Here the weights for each pageview is the amount of time (e.g., in seconds) that a particular user spent on the pageview. Adding the users' geolocation, this model can be refined to be a three-dimensional matrix, by which each transaction can be viewed as t = (x, y, tv), where x and y represent its geographic coordinates (longitude and latitude), and tv denotes its transaction vector in the form of formula (2). Fig. 7 illustrates the data model structure. With such data model, should spatial analysis of user behavior, for example user similarity analysis considering both transaction vector distance and geographical distance, be feasible.

$100 150 I - tv(transaction vector) I 50 o..})'l -50 -\J>-S 30 25 20 l Q) '> 15 If. 10 Figure 7.$

5 I - tv(transaction vector) I 50 o..})'l -50 -\J>-S l Q) '> 15 If. 10 Figure 7. An example of a georeferenced user transaction data model, blue line represents a transaction vector of a user located at 30oE, 45 N. 5 o [5] Data Sharing Platform of Earth System Science, "Operating report 2014, " [6] R. Cooley, B. Mobasher and J. Srivastava, "Data preparation for mining world wide web browsing patterns. Knowledge and Information Systems, " 1999, 1(1): p [7] P. N. Tan and V. Kumar, "Discovery of web robot sessions based on their navigational patterns, " Data Mining and Knowledge Discovery, 2002, 6(1): [8] B. Liu, Web Data Mining, 2nd ed., Berlin: Springer. 2011, pp [9] ipinfo.io, [10] Zhu, P. and M.-s. Zhao, "Session identification algorithm for web log mining, " International Conference on Management and Service Science. 20 I 0 [II] B. Berendt, B. Mobasherb and M. Nakagawa, 'The impact of site structure and user environment on session reconstruction in web usage analysis, " Knowledge Discovery and Data Mining, 2002, [12] L. C. Feng, "Study on crucial techniques of web usage mining, " Wuhan: Huazhong Univ. of Science and Technology, 2007, Chinese. [13] Tanasa, D. and B. Trousse, "Advanced data preprocessing for intersites web usage mining, " IEEE Expert I IEEE Intelligent Systems, (2): B. Discussion This study implemented data preprocessing procedures for user behavior mining of a geoscience data sharing portal. The aim of the study is to set up a data preprocessing framework and to yield ready-to-use data for further data mining task, e.g. user classification, association rules of users' data interest. Three preprocessing steps were conducted using heuristic methods, and users' geolocation were identified according to their IP address. These procedures are indispensable for mining user behaviors and their spatial attributes. With the two methods of session identification, Href is deemed to be more plausible with careful examining on the log entries. Final results of these procedures are written into a database along with log entry identifier. Depending on specific data mining task, cleaned log entry, user, session, and geolocation information can be read from the database and with proper data formatting and data modeling, data mining tasks thus can be achieved. Future work will focus on data mining based on users' interest in geoscience data by parsing URL requests within each session. ACKNOWLEDGMENT The authors would like to express appreciations for data support from Data Sharing Platform of Earth System Science, National Science & Technology Infrastructure of China. REFERENCES [I] C. Tenopir, S. Allard and K. Douglass, "Data sharing by scientists: practices and perceptions, " PLoS ONE, 2011, 6(6): p. e [2] Zhu, Y., Sun J. and Liao S., "Earth System Scientific Data Sharing Research and Practice: Earth system scientific data sharing research and practice, " Geo-information Science, 2010, 12(1): 1-8. [3] R. Kosala and H. Blockeel, "Web mining research: a survey, " Sigkdd Explorations, 2000, 2(1): p [4] J. Srivastava, R. Cooley and M. Deshpande, "Web usage mining: discovery and applications of usage patterns from web data, " Sigkdd Explorations, 2000, 1(2): p

User Session Identification Using Enhanced Href Method

User Session Identification Using Enhanced Href Method Department of Computer Science, Constantine the Philosopher University in Nitra, Slovakia jkapusta@ukf.sk, psvec@ukf.sk, mmunk@ukf.sk, jskalka@ukf.sk