Data Preprocessing Based on Partially Supervised Learning Na Liu1,2, a, Guanglai Gao1,b, Guiping Liu2,c

6th Internatonal Conference on Informaton Engneerng for Mechancs and Materals (ICIMM 2016) Data Preprocessng Based on Partally Supervsed Learnng Na Lu1,2, a, Guangla Gao1,b, Gupng Lu2,c 1 College of Computer Scence Inner Mongola Unversty Hohhot, Chna 2 Department of Scence Hetao Unversty Bayannur, Chna a csna_lu@163.com,bcsggl@mu.edu.cn, ccsgupng_lu@163.com Keywords: data preprocessng; Web Log Mnng; rule; partally supervsed learnng Abstract. Data preprocessng s the foundaton to mprove the qualty of data mnng and determnes the effect of Web mnng. Currently, data for mnng s typcally collected from the server, but data set from the clent s more accurate. In order to better deal wth these data, we propose a data preprocessng method based on partally supervsed learnng. The paper dscusses n detal data cleanng process based on partally supervsed learnng, and conducts experments to verfy the valdty of the method employed, and ultmately determnes the optmal number of tranng documents. Introducton Wth the advent of the nformaton age, Internet has become an mportant way for people to obtan nformaton. Companes perform e-commerce actvtes on the Internet, deploy and develop Internet marketng, and ths has become an mportant part of marketng busness development. Therefore, web mnng, whch ams to fnd the users access law, has become a hot topc for all enterprses and organzatons under the network envronment. The goal of Web mnng s to explore the useful value from the hyperlnks structure, content and usage logs. Accordng to the use of data categores n the mnng process, Web mnng s dvded nto three types: Web structure mnng, Web content mnng and Web log mnng [1]. The man purpose of Web log mnng s to fnd Web page user access mode. Mnng s generally carred out n three steps: preprocessng, data mnng and subsequent processng. Due to mmature extracton technology, there are some ncomplete data, solated ponts and nose data n Web logs and these logs can not be drectly used for mnng[2][3]. So, pretreatment determnes the effect of Web mnng and s the bass of ths process. Improvng the qualty of preprocessng of data mnng to meet the specfc requrements of the data can mprove the effcency of decson-makng. Ths paper proposes a data preprocessng method based on partally supervsed learnng and dscusses n detal the data cleanng process. Related Works In 1999 Pyle [4] for the frst tme used data preprocessng n Web log mnng. In the same year, Cooley [5] ponted out that fxng data errors and handlng mssng data are key tasks for data preprocessng. Because users access to multple Webs or use applcaton servers, Tanasa [6] n 2005 presented n hs artcle that the log fles from multple Webs and applcaton servers be merged. [7] It also proposed a preprocessng algorthm based on collaboratve flterng. Currently, because most data for Web log mnng s collected drectly from the Web server's log fles, there s a lot of junk data and ncomplete browsng records, naccurate browsng tme and other problems. Clent data based on user behavor sources can provde comprehensve and accurate nformaton on user browsng behavor. In earler years, Stellelper s [8] team used the clent to extract user nformaton. Due to volaton of user prvacy, the system faled to fully go nto market operatons. P3P (Platform for Prvacy Preferences) technology provdes a prvacy protecton strategy, and allows clent data acquston to become a realty. But n the Web log mnng feld, clent data 2016. The authors - Publshed by Atlants Press 678

mnng research s stll relatvely rare; there s much room for mprovement n data preprocessng technques. Data preprocessng ncludes data cleanng, transformaton, ntegraton, reducton and so on. Ths paper ntends to use partally supervsed learnng methods to clean up web browsng log fles, and to valdate the method through experments. Data Cleanng Based on Partally Supervsed Learnng For varous reasons, there are always drty data n the collected data set e.g. ncomplete data, llegal value, the null value, nconsstent values, solated ponts etc. To solve the above problems n the data set s to clean up drty data. A. Prncple of data cleanup The core dea of data cleanng s to fnd out characterstcs of data and to extract, accordng to these features, to desgn and mplement effectve algorthm, rules and strateges, and ultmately complete data cleanng. The man task of data cleanng ncludes: Predct mssng values and complement ncomplete data. Identfy and remove llegal and null values. Convert or delete nconsstent data. Apply sutable algorthms, rules and strateges to amend or delete the abnormal data. Based on the above analyss of data cleansng prncples and tasks, the basc process of data cleanng ncludes analyzng characterstcs of data, defnng data cleanng rules, performng data cleansng, valdatng data and reflowng clean data. The basc process of data cleanng s as follows: Data set Analyzng Defnng Perform data cleanng tasks Test and verfy data Clean-date Fg.1 Process of data cleanng Data mnng analyss s to analyze the collected data and extract the laws. The dscovery process s to fnd out abnormal data and defne the prelmnary cleanup rules and procedures. Defnng data cleanup rules means classfyng of data on the bass of further data analyss, and defnng of a detaled cleanup rules for dfferent categores. Performng data cleanng refers to defnng good data cleanup rules and applyng approprate algorthms and strateges to mplement and execute on the data source. Data valdaton refers to verfyng the correctness of the data cleanup rules and evaluaton of effcency through analyss of multple teraton, defnton, mplementaton and verfcaton untl satsfactory data cleanng rules are found. Reflow means that when the data s cleansed, clean data should replace the source data. B.Markng Postve Examples for Rule-Based Learnng Drty data (negatve example) s nherently uncertan, and causes the dffcultes n the dentfcaton and defnton of cleanup rules. And n our case, most complete data (postve example) from source data have sgnfcant features, and conform to busness rules. In ths study, the orgnal data samples are as shown n Fg.2. 679

Last<=>1890 L_Start<=>2012-05-08 22-36-46 T<=>100[=]P<=>explorer.exe[=]I<=>132[=]W<=>10096[=]V<=>6.00.2900.5512[=]N<=>Mcrosoft(R) Wndows(R) Operatng System[=]C<=>Mcrosoft Corporaton T<=>105[=]P<=>QQ.exe[=]I<=>788[=]W<=>102c2[=]V<=>1.75.2991.674[=]N<=>NULL[=]C<=>Tencent T<=>186[=]P<=>360chrome.exe[=]I<=>3492[=]U<=>NULL[=]A<=>2027c[=]B<=>30282[=]V<=>5.2.0.804 T<=>192[=]P<=>360chrome.exe[=]I<=>3492[=]U<=>wwww[=]A<=>2027c[=]B<=>30282[=]V<=>5.2.0.804 T<=>194[=]P<=>360chrome.exe[=]I<=>3492[=]U<=>www.cqhrss.gov.cn/u/cqhrss/[=]A<=>2027c[=]B<=>60 43e[=]V<=>5.2.0.804 T<=>211[=]P<=>explore.exe[=]I<=>3056[=]U<=>www.tao[=]A<=>602c8[=]B<=>8068e[=]V<=>8.00.6001. 18702 T<=>312[=]P<=>360chrome.exe[=]I<=>3492[=]U<=>www.bbzkb.net[=]A<=>2027c[=]B<=>20336[=]V<=> 5.2.0.804 T<=>292[=]P<=>explore.exe[=]I<=>680[=]U<=>http://www.bxwx.org/text/5/5189.html[=]A<=>11058a[=]B <=>105f8[=]V<=>8.00.6001.18702 T<=>328[=]P<=>explore.exe[=]I<=>3140[=]U<=>www.ba[=]A<=>1e03a2[=]B<=>60598[=]V<=>8.00.760 0.16385 T<=>340[=]P<=>explore.exe[=]I<=>3268[=]U<=>http://www.bbc.co.uk/[=]A<=>20228[=]B<=>202ce[=]V< =>8.00.6001.18702 T<=>569[=]P<=>explore.exe[=]I<=>3268[=]U<=>http://cn.wsj.com/gb/ndex.asp[=]A<=>20228[=]B<=>102 fc[=]v<=>8.00.6001.18702 T<=>1245[=]P<=>explore.exe[=]I<=>3268[=]U<=>http://10.5.5.108/_layouts/CopyUtl.aspx[=]A<=>20228[ =]B<=>44070e[=]V<=>8.00.6001.18702 T<=>1379[=]P<=>360chrome.exe[=]I<=>3492[=]U<=>www.bbzkb.net[=]A<=>2027c[=]B<=>20336[=]V<= >5.2.0.804 T<=>3065[=]P<=>explore.exe[=]I<=>2276[=]U<=>http://192.168.1.1/[=]A<=>10564[=]B<=>105a0[=]V<=> 8.00.6001.18702 T<=>1504[=]P<=>360chrome.exe[=]I<=>3492[=]U<=>www.badu.com[=]A<=>302fe[=]B<=>NULL[=]V<= >5.2.0.804 T<=>1700[=]P<=>360chrome.exe[=]I<=>3492[=]U<=>www.badu.com[=]A<=>302fe[=]B<=>4030a[=]V<= >5.2.0.804 T<=>1862[=]P<=>QQ exe[=]i<=>788[=]w<=>f0384[=]v<=>1 75 2991 674 Fg.2 Raw data (1)PU leanng Partally supervsed learnng s dvded nto LU learnng (learnng from Labeled and Unlabeled examples) and PU learnng (Learnng from Postve and Unlabeled examples). PU learnng s to put data nto postve and negatve examples. However, there are no labeled negatve examples for learnng. In ths study, attempt to defne a set of data n lne wth busness norms (P) and to dentfy non-labeled data set (U), that contans all knds of data. PU learnng s to use and buld a classfer, the postve examples wll be marked out. Accordng to the lterature [9], PU study s dvded nto two steps n ths artcle: The frst step: use the rules to extract postve examples and obtan P. The second step: establsh a SVM classfer and mark the postve examples. (2)Extracton of rule-based postve examples Web address, or Unform Resource Locator (URL), s the standard resource addresses on the Internet. Consttute of the basc s: scheme://doman:port/path?query_strng#fragment_ d. In the data collected, due to dfferent browsers and protecton of personal prvacy, web address s dfferent from a standard or full format, as shown n Fg.2. Through prelmnary observaton and analyss of the tranng data, n ths case, the format of the data n the server name = [host name]. doman.[top level doman]. Vald web address s defned as {%% top-level doman %}, as shown n Fg.2, where% represents any strng. In order to mprove the operatng effcency of the program and the analyss objects are from Chnese Internet users logs, so the top-level domans are all nternatonal top-level domans and Chna s natonal doman s known as.cn. These 20 rules are shown n Table. 680

Table.1 The Rule lst Rule No. Formalzed rules 1 %.%.com% 2 %.%.net% 3 %.%.cn% 20 %.%.asa% Accordng to the above rules, the extracted sample set of postve examples s shown n Fg. 3. T<=>194[=]P<=>360chrome.exe[=]I<=>3492[=]U<=>www.cqhrss.gov.cn/u/cqhrss/[=]A<=>2027c[=]B<=>6043 e[=]v<=>5.2.0.804 T<=>312[=]P<=>360chrome.exe[=]I<=>3492[=]U<=>www.bbzkb.net[=]A<=>2027c[=]B<=>20336[=]V<=>5.2. 0.804 T<=>292[=]P<=>explore.exe[=]I<=>680[=]U<=>http://www.bxwx.org/text/5/5189.html[=]A<=>11058a[=]B<= >105f8[=]V<=>8.00.6001.18702 T<=>569[=]P<=>explore.exe[=]I<=>3268[=]U<=>http://cn.wsj.com/gb/ndex.asp[=]A<=>20228[=]B<=>102fc[ =]V<=>8.00.6001.18702 T<=>1379[=]P<=>360chrome.exe[=]I<=>3492[=]U<=>www.bbzkb.net[=]A<=>2027c[=]B<=>20336[=]V<=>5. 2.0.804 T<=>1504[=]P<=>360chrome.exe[=]I<=>3492[=]U<=>www.badu.com[=]A<=>302fe[=]B<=>NULL[=]V<=>5. 2.0.804 T<=>1700[=]P<=>360chrome.exe[=]I<=>3492[=]U<=>www.badu.com[=]A<=>302fe[=]B<=>4030a[=]V<=>5. 2.0.804 Fg.3 Postve examples (3)Buldng a SVM classfer Set the tranng set as, where x s the nput vector, y s ts class ID. Assume that the frst t 1examples are postve examples P (typed 1), the rest of the data set are unmarked examples U. {( x, y ),( x, y 1 1 2 2),,( x n, y n)} Accordng to the related theory, we have ths followng formulaton: Mnmze: n w w C (1) 2 t Condtons: w x b 1, 1,2,, t 1 1( w x b) 1, t, t 1,, n (2) 0, t, t 1,, n Where w s the parameter, the greater the value, the more obvous the border; stands for slack varables; b R for bas; C s penalty coeffcent, f a pont belongs to a certan class, but devates from the class and goes to other places on he border. The greater the, whch shows that t does not want to gve up the pont, the more the boundary wll shrnk. In PU learnng, the postve examples are marked out and unlabeled data s defned as ncomplete data. C.Cleanup of ncomplete data Through analyss, ncomplete data s dvded nto the followng categores: (1) Partally-deleted, but auto-complete ncomplete data. (2) The data s complant wth busness rules, but has not been marked as postve examples, such as (www.bbc.co.uk, 10.5.5.108). (3) Partally deleted, and need human nterventon to complement ncomplete data. (4) Null values and other error condtons not mentoned above. To clean up the above data, follow the (1) - (4) n order. Clean-up steps are as follows: 681

Step1: Compare the ncomplete data and data labeled as postve examples. If the key substrngs match exactly, then use the data n postve examples to complement the ncomplete data. Step2: Analyze the characterstcs of the data, redefne flterng rules, and turn the legal data nto a complete data by manual nterventon. Step3: Treat ncomplete data n Step3 by human nterventon. Step4: Mark the results of the data acqured from the three steps as labeled postve examples. Step5: Remove all unlabeled data from the data set. Experment Results and Analyss Based on partal data cleanup, supervsed learnng experment s conducted under Wndows 7 operatng system envronment, wth MyEclpse development tools and n Java language. Use LbSVM package, parameters nvolved n the regulaton of the software s relatvely small, n ths study; all experments were performed usng the default parameters. The tranng set has 1000 browse log fles and the testng has 3000 browse log fles. Each fle has about 0-2021 records and the sze about 1k-3000k. To reduce the mpact of nose on the system, before the start of LbSVM experment, all log records are processed to remove stop words. In ths study, a total of 22 stop words are set ncludng www, http,.com. In order to verfy the effectveness of extracton rules and partally supervsed learnng, ths artcle evaluated data cleanng effect n the experment. For valdaton, comparsons were conducted 10 tmes durng the experment wth randomly selected tranng set of log fles (the number of fles s {100,200,..., 1000})and two performance metrcs (p, r), where p s precson, r s recall. Table Ⅱ compares the performance based on dfferent tranng set. Experment results show that wth the ncrease of tranng documents, the mpact of ncreasng the number of fles on the results s low, and may be abnormal. For ths experment, the number of tranng fles between 500 and 700 s most preferred. Table.2 Performance Comparson Lst Tranng document number p r 100 0.9579 0.7100 300 0.9289 0.7854 500 0.9253 0.8195 600 0.9131 0.8344 700 0.9023 0.8415 800 0.8982 0.8242 1000 0.8945 0.8373 Conclusons The results of ths experment show that the rule-based learnng and supervsed learnng can both mprove the effcency of data preprocessng. Ths paper dscussed n detal the data cleanng process based on partally supervsed learnng, and through the experment, the valdty of the method employed s verfed, and the optmal number of tranng documents s ultmately determned. In ths study, although Web log mnng data preprocessng related problems are studed, yet there s stll need for further research and mprovement: Frst, optmzaton of LbSVM parameters. Second, further research of data transformaton and reducton methods s needed to complete data preprocessng. Acknowledgment Ths work s supported by the Natural Scence Foundaton of Chna (NO.61563040), Natural Scence Foundaton of Inner Mongola, Chna (NO. 2016ZD06), Natural Scence Foundaton n Unverstes of Inner Mongola Autonomous Regon, Chna (NO.NJZY14334). 682

References [1]WANG Sh,GAO Wen, LI Jn-Tao.Path clusterng: dscoverng the knowledgen the web ste. Journal of Computer Research and Development,vol.04,2001.pp.482-486. [2]Lenzern, M.. Data ntegraton: A theoretcal perspectve[c].//in: Popa, L. (ed.). Proceedngs of the Twenty-frst ACM SIGACT SIGMOD-SIGART Symposum on Prncples of Database Systems (PODS 2002), 2002: 233-246. [3]Cszak, L.. Applcatons of clusterng and assocaton methods n data cleanng[c].//in: Proceedngs of the Internatonal Multconference on Computer Scence and InformatonTechnology, 2008,03: 97-103. [4]Pyle, D. Data Preparaton for Data Mnng[M]. Morgan Kaufmann Publshers Inc., San Francsco, CA. 1999: 540. [5]R. Cooley, B. Mobasher and J. Srvastava. Data preparaton for mnng World Wde Web browsng patterns[j]. Journal of Knowledge and Informaton Systems, 1999, 1 (1):5-32. [6]Tanasa, D.and B.Trousse.Advanced data preprocessng for nterstes web usage mnng[j]. Intellgent Systems, IEEE, 2005,19 (2): 59-65. [7]ING Chang-bn and Chen L. Web Log Data Preprocessng Based On Collaboratve Flterng[C].// In:Proceedngs of the IEEE 2nd Internatonal Workshop On Educaton Technology and Computer Scence, 2010:118-121. [8]DS Ngu, X. Wu. Stehelper: A localzed agent that helps ncremental exploraton of the World Wde Web[C]. //In: Proceedngs of the 6th Internatonal World Wde Web Conference, Santa Clara, 1997:691-700. [9]B.Lu, Y.Da, X.L, W.lee, and P.Yu.Buldng text classfers usng postve and unlabeled examples[c].//in: Proceedngs of the Thrd IEEE Internatonal Conference on Data Mnng (ICDM-2003), 2003:19-22. 683