Data Preprocessing Based on Partially Supervised Learning Na Liu1,2, a, Guanglai Gao1,b, Guiping Liu2,c

Similar documents
Cluster Analysis of Electrical Behavior

The Research of Support Vector Machine in Agricultural Data Classification

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

Available online at Available online at Advanced in Control Engineering and Information Science

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

An Optimal Algorithm for Prufer Codes *

Parallelism for Nested Loops with Non-uniform and Flow Dependences

BioTechnology. An Indian Journal FULL PAPER. Trade Science Inc.

Recommended Items Rating Prediction based on RBF Neural Network Optimized by PSO Algorithm

Face Recognition Method Based on Within-class Clustering SVM

Support Vector Machines

Classifier Selection Based on Data Complexity Measures *

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

Professional competences training path for an e-commerce major, based on the ISM method

A CALCULATION METHOD OF DEEP WEB ENTITIES RECOGNITION

Machine Learning. Topic 6: Clustering

Analysis on the Workspace of Six-degrees-of-freedom Industrial Robot Based on AutoCAD

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

The Research of Tax Text Categorization based on Rough Set

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

Research on Categorization of Animation Effect Based on Data Mining

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Enhancement of Infrequent Purchased Product Recommendation Using Data Mining Techniques

A Feature-Weighted Instance-Based Learner for Deep Web Search Interface Identification

A new segmentation algorithm for medical volume image based on K-means clustering

An Improved Image Segmentation Algorithm Based on the Otsu Method

Meta-heuristics for Multidimensional Knapsack Problems

A Binarization Algorithm specialized on Document Images and Photos

Load Balancing for Hex-Cell Interconnection Network

The Codesign Challenge

A New Approach For the Ranking of Fuzzy Sets With Different Heights

Network Intrusion Detection Based on PSO-SVM

Edge Detection in Noisy Images Using the Support Vector Machines

Maximum Variance Combined with Adaptive Genetic Algorithm for Infrared Image Segmentation

UB at GeoCLEF Department of Geography Abstract

Smoothing Spline ANOVA for variable screening

A Resources Virtualization Approach Supporting Uniform Access to Heterogeneous Grid Resources 1

Performance Evaluation of an ANFIS Based Power System Stabilizer Applied in Multi-Machine Power Systems

An Image Fusion Approach Based on Segmentation Region

Classification / Regression Support Vector Machines

Audio Content Classification Method Research Based on Two-step Strategy

Virtual Machine Migration based on Trust Measurement of Computer Node

User Authentication Based On Behavioral Mouse Dynamics Biometrics

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

On Some Entertaining Applications of the Concept of Set in Computer Science Course

Type-2 Fuzzy Non-uniform Rational B-spline Model with Type-2 Fuzzy Data

The Research of Ellipse Parameter Fitting Algorithm of Ultrasonic Imaging Logging in the Casing Hole

An Anti-Noise Text Categorization Method based on Support Vector Machines *

Solving two-person zero-sum game by Matlab

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

Proper Choice of Data Used for the Estimation of Datum Transformation Parameters

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS. Muradaliyev A.Z.

An IPv6-Oriented IDS Framework and Solutions of Two Problems

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

ON SOME ENTERTAINING APPLICATIONS OF THE CONCEPT OF SET IN COMPUTER SCIENCE COURSE

Querying by sketch geographical databases. Yu Han 1, a *

Novel Pattern-based Fingerprint Recognition Technique Using 2D Wavelet Decomposition

FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK

A Facet Generation Procedure. for solving 0/1 integer programs

Classifying Acoustic Transient Signals Using Artificial Intelligence

Modular PCA Face Recognition Based on Weighted Average

Fast Computation of Shortest Path for Visiting Segments in the Plane

APPLICATION OF MULTIVARIATE LOSS FUNCTION FOR ASSESSMENT OF THE QUALITY OF TECHNOLOGICAL PROCESS MANAGEMENT

A Modified Median Filter for the Removal of Impulse Noise Based on the Support Vector Machines

Preprocessing of Web Usage Data for Application in Prefetching to Reduce Web Latency

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide

BIN XIA et al: AN IMPROVED K-MEANS ALGORITHM BASED ON CLOUD PLATFORM FOR DATA MINING

Module Management Tool in Software Development Organizations

Learning-Based Top-N Selection Query Evaluation over Relational Databases

Security Vulnerabilities of an Enhanced Remote User Authentication Scheme

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

Under-Sampling Approaches for Improving Prediction of the Minority Class in an Imbalanced Dataset

Support Vector Machines

International Journal of Industrial Engineering Computations

Study of Data Stream Clustering Based on Bio-inspired Model

Sum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints

Using Fuzzy Logic to Enhance the Large Size Remote Sensing Images

Spam Filtering Based on Support Vector Machines with Taguchi Method for Parameter Selection

An Internal Clustering Validation Index for Boolean Data

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Design of Simulation Model on the Battlefield Environment ZHANG Jianli 1,a, ZHANG Lin 2,b *, JI Lijian 1,c, GUO Zhongwei 1,d

BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET

THE CONDENSED FUZZY K-NEAREST NEIGHBOR RULE BASED ON SAMPLE FUZZY ENTROPY

Review of approximation techniques

A Deflected Grid-based Algorithm for Clustering Analysis

Face Recognition Based on SVM and 2DPCA

Research on Data Mining Model of Intelligent Transportation Based on Granular Computing

Learning from Multiple Related Data Streams with Asynchronous Flowing Speeds

Implementation Naïve Bayes Algorithm for Student Classification Based on Graduation Status

Enhanced Watermarking Technique for Color Images using Visual Cryptography

Chinese Word Segmentation based on the Improved Particle Swarm Optimization Neural Networks

Quality Improvement Algorithm for Tetrahedral Mesh Based on Optimal Delaunay Triangulation

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Cracking of the Merkle Hellman Cryptosystem Using Genetic Algorithm

Human Face Recognition Using Generalized. Kernel Fisher Discriminant

A high precision collaborative vision measurement of gear chamfering profile

5.0 Quality Assurance

Transcription:

6th Internatonal Conference on Informaton Engneerng for Mechancs and Materals (ICIMM 2016) Data Preprocessng Based on Partally Supervsed Learnng Na Lu1,2, a, Guangla Gao1,b, Gupng Lu2,c 1 College of Computer Scence Inner Mongola Unversty Hohhot, Chna 2 Department of Scence Hetao Unversty Bayannur, Chna a csna_lu@163.com,bcsggl@mu.edu.cn, ccsgupng_lu@163.com Keywords: data preprocessng; Web Log Mnng; rule; partally supervsed learnng Abstract. Data preprocessng s the foundaton to mprove the qualty of data mnng and determnes the effect of Web mnng. Currently, data for mnng s typcally collected from the server, but data set from the clent s more accurate. In order to better deal wth these data, we propose a data preprocessng method based on partally supervsed learnng. The paper dscusses n detal data cleanng process based on partally supervsed learnng, and conducts experments to verfy the valdty of the method employed, and ultmately determnes the optmal number of tranng documents. Introducton Wth the advent of the nformaton age, Internet has become an mportant way for people to obtan nformaton. Companes perform e-commerce actvtes on the Internet, deploy and develop Internet marketng, and ths has become an mportant part of marketng busness development. Therefore, web mnng, whch ams to fnd the users access law, has become a hot topc for all enterprses and organzatons under the network envronment. The goal of Web mnng s to explore the useful value from the hyperlnks structure, content and usage logs. Accordng to the use of data categores n the mnng process, Web mnng s dvded nto three types: Web structure mnng, Web content mnng and Web log mnng [1]. The man purpose of Web log mnng s to fnd Web page user access mode. Mnng s generally carred out n three steps: preprocessng, data mnng and subsequent processng. Due to mmature extracton technology, there are some ncomplete data, solated ponts and nose data n Web logs and these logs can not be drectly used for mnng[2][3]. So, pretreatment determnes the effect of Web mnng and s the bass of ths process. Improvng the qualty of preprocessng of data mnng to meet the specfc requrements of the data can mprove the effcency of decson-makng. Ths paper proposes a data preprocessng method based on partally supervsed learnng and dscusses n detal the data cleanng process. Related Works In 1999 Pyle [4] for the frst tme used data preprocessng n Web log mnng. In the same year, Cooley [5] ponted out that fxng data errors and handlng mssng data are key tasks for data preprocessng. Because users access to multple Webs or use applcaton servers, Tanasa [6] n 2005 presented n hs artcle that the log fles from multple Webs and applcaton servers be merged. [7] It also proposed a preprocessng algorthm based on collaboratve flterng. Currently, because most data for Web log mnng s collected drectly from the Web server's log fles, there s a lot of junk data and ncomplete browsng records, naccurate browsng tme and other problems. Clent data based on user behavor sources can provde comprehensve and accurate nformaton on user browsng behavor. In earler years, Stellelper s [8] team used the clent to extract user nformaton. Due to volaton of user prvacy, the system faled to fully go nto market operatons. P3P (Platform for Prvacy Preferences) technology provdes a prvacy protecton strategy, and allows clent data acquston to become a realty. But n the Web log mnng feld, clent data 2016. The authors - Publshed by Atlants Press 678

mnng research s stll relatvely rare; there s much room for mprovement n data preprocessng technques. Data preprocessng ncludes data cleanng, transformaton, ntegraton, reducton and so on. Ths paper ntends to use partally supervsed learnng methods to clean up web browsng log fles, and to valdate the method through experments. Data Cleanng Based on Partally Supervsed Learnng For varous reasons, there are always drty data n the collected data set e.g. ncomplete data, llegal value, the null value, nconsstent values, solated ponts etc. To solve the above problems n the data set s to clean up drty data. A. Prncple of data cleanup The core dea of data cleanng s to fnd out characterstcs of data and to extract, accordng to these features, to desgn and mplement effectve algorthm, rules and strateges, and ultmately complete data cleanng. The man task of data cleanng ncludes: Predct mssng values and complement ncomplete data. Identfy and remove llegal and null values. Convert or delete nconsstent data. Apply sutable algorthms, rules and strateges to amend or delete the abnormal data. Based on the above analyss of data cleansng prncples and tasks, the basc process of data cleanng ncludes analyzng characterstcs of data, defnng data cleanng rules, performng data cleansng, valdatng data and reflowng clean data. The basc process of data cleanng s as follows: Data set Analyzng Defnng Perform data cleanng tasks Test and verfy data Clean-date Fg.1 Process of data cleanng Data mnng analyss s to analyze the collected data and extract the laws. The dscovery process s to fnd out abnormal data and defne the prelmnary cleanup rules and procedures. Defnng data cleanup rules means classfyng of data on the bass of further data analyss, and defnng of a detaled cleanup rules for dfferent categores. Performng data cleanng refers to defnng good data cleanup rules and applyng approprate algorthms and strateges to mplement and execute on the data source. Data valdaton refers to verfyng the correctness of the data cleanup rules and evaluaton of effcency through analyss of multple teraton, defnton, mplementaton and verfcaton untl satsfactory data cleanng rules are found. Reflow means that when the data s cleansed, clean data should replace the source data. B.Markng Postve Examples for Rule-Based Learnng Drty data (negatve example) s nherently uncertan, and causes the dffcultes n the dentfcaton and defnton of cleanup rules. And n our case, most complete data (postve example) from source data have sgnfcant features, and conform to busness rules. In ths study, the orgnal data samples are as shown n Fg.2. 679

Last<=>1890 L_Start<=>2012-05-08 22-36-46 T<=>100[=]P<=>explorer.exe[=]I<=>132[=]W<=>10096[=]V<=>6.00.2900.5512[=]N<=>Mcrosoft(R) Wndows(R) Operatng System[=]C<=>Mcrosoft Corporaton T<=>105[=]P<=>QQ.exe[=]I<=>788[=]W<=>102c2[=]V<=>1.75.2991.674[=]N<=>NULL[=]C<=>Tencent T<=>186[=]P<=>360chrome.exe[=]I<=>3492[=]U<=>NULL[=]A<=>2027c[=]B<=>30282[=]V<=>5.2.0.804 T<=>192[=]P<=>360chrome.exe[=]I<=>3492[=]U<=>wwww[=]A<=>2027c[=]B<=>30282[=]V<=>5.2.0.804 T<=>194[=]P<=>360chrome.exe[=]I<=>3492[=]U<=>www.cqhrss.gov.cn/u/cqhrss/[=]A<=>2027c[=]B<=>60 43e[=]V<=>5.2.0.804 T<=>211[=]P<=>explore.exe[=]I<=>3056[=]U<=>www.tao[=]A<=>602c8[=]B<=>8068e[=]V<=>8.00.6001. 18702 T<=>312[=]P<=>360chrome.exe[=]I<=>3492[=]U<=>www.bbzkb.net[=]A<=>2027c[=]B<=>20336[=]V<=> 5.2.0.804 T<=>292[=]P<=>explore.exe[=]I<=>680[=]U<=>http://www.bxwx.org/text/5/5189.html[=]A<=>11058a[=]B <=>105f8[=]V<=>8.00.6001.18702 T<=>328[=]P<=>explore.exe[=]I<=>3140[=]U<=>www.ba[=]A<=>1e03a2[=]B<=>60598[=]V<=>8.00.760 0.16385 T<=>340[=]P<=>explore.exe[=]I<=>3268[=]U<=>http://www.bbc.co.uk/[=]A<=>20228[=]B<=>202ce[=]V< =>8.00.6001.18702 T<=>569[=]P<=>explore.exe[=]I<=>3268[=]U<=>http://cn.wsj.com/gb/ndex.asp[=]A<=>20228[=]B<=>102 fc[=]v<=>8.00.6001.18702 T<=>1245[=]P<=>explore.exe[=]I<=>3268[=]U<=>http://10.5.5.108/_layouts/CopyUtl.aspx[=]A<=>20228[ =]B<=>44070e[=]V<=>8.00.6001.18702 T<=>1379[=]P<=>360chrome.exe[=]I<=>3492[=]U<=>www.bbzkb.net[=]A<=>2027c[=]B<=>20336[=]V<= >5.2.0.804 T<=>3065[=]P<=>explore.exe[=]I<=>2276[=]U<=>http://192.168.1.1/[=]A<=>10564[=]B<=>105a0[=]V<=> 8.00.6001.18702 T<=>1504[=]P<=>360chrome.exe[=]I<=>3492[=]U<=>www.badu.com[=]A<=>302fe[=]B<=>NULL[=]V<= >5.2.0.804 T<=>1700[=]P<=>360chrome.exe[=]I<=>3492[=]U<=>www.badu.com[=]A<=>302fe[=]B<=>4030a[=]V<= >5.2.0.804 T<=>1862[=]P<=>QQ exe[=]i<=>788[=]w<=>f0384[=]v<=>1 75 2991 674 Fg.2 Raw data (1)PU leanng Partally supervsed learnng s dvded nto LU learnng (learnng from Labeled and Unlabeled examples) and PU learnng (Learnng from Postve and Unlabeled examples). PU learnng s to put data nto postve and negatve examples. However, there are no labeled negatve examples for learnng. In ths study, attempt to defne a set of data n lne wth busness norms (P) and to dentfy non-labeled data set (U), that contans all knds of data. PU learnng s to use and buld a classfer, the postve examples wll be marked out. Accordng to the lterature [9], PU study s dvded nto two steps n ths artcle: The frst step: use the rules to extract postve examples and obtan P. The second step: establsh a SVM classfer and mark the postve examples. (2)Extracton of rule-based postve examples Web address, or Unform Resource Locator (URL), s the standard resource addresses on the Internet. Consttute of the basc s: scheme://doman:port/path?query_strng#fragment_ d. In the data collected, due to dfferent browsers and protecton of personal prvacy, web address s dfferent from a standard or full format, as shown n Fg.2. Through prelmnary observaton and analyss of the tranng data, n ths case, the format of the data n the server name = [host name]. doman.[top level doman]. Vald web address s defned as {%% top-level doman %}, as shown n Fg.2, where% represents any strng. In order to mprove the operatng effcency of the program and the analyss objects are from Chnese Internet users logs, so the top-level domans are all nternatonal top-level domans and Chna s natonal doman s known as.cn. These 20 rules are shown n Table. 680

Table.1 The Rule lst Rule No. Formalzed rules 1 %.%.com% 2 %.%.net% 3 %.%.cn% 20 %.%.asa% Accordng to the above rules, the extracted sample set of postve examples s shown n Fg. 3. T<=>194[=]P<=>360chrome.exe[=]I<=>3492[=]U<=>www.cqhrss.gov.cn/u/cqhrss/[=]A<=>2027c[=]B<=>6043 e[=]v<=>5.2.0.804 T<=>312[=]P<=>360chrome.exe[=]I<=>3492[=]U<=>www.bbzkb.net[=]A<=>2027c[=]B<=>20336[=]V<=>5.2. 0.804 T<=>292[=]P<=>explore.exe[=]I<=>680[=]U<=>http://www.bxwx.org/text/5/5189.html[=]A<=>11058a[=]B<= >105f8[=]V<=>8.00.6001.18702 T<=>569[=]P<=>explore.exe[=]I<=>3268[=]U<=>http://cn.wsj.com/gb/ndex.asp[=]A<=>20228[=]B<=>102fc[ =]V<=>8.00.6001.18702 T<=>1379[=]P<=>360chrome.exe[=]I<=>3492[=]U<=>www.bbzkb.net[=]A<=>2027c[=]B<=>20336[=]V<=>5. 2.0.804 T<=>1504[=]P<=>360chrome.exe[=]I<=>3492[=]U<=>www.badu.com[=]A<=>302fe[=]B<=>NULL[=]V<=>5. 2.0.804 T<=>1700[=]P<=>360chrome.exe[=]I<=>3492[=]U<=>www.badu.com[=]A<=>302fe[=]B<=>4030a[=]V<=>5. 2.0.804 Fg.3 Postve examples (3)Buldng a SVM classfer Set the tranng set as, where x s the nput vector, y s ts class ID. Assume that the frst t 1examples are postve examples P (typed 1), the rest of the data set are unmarked examples U. {( x, y ),( x, y 1 1 2 2),,( x n, y n)} Accordng to the related theory, we have ths followng formulaton: Mnmze: n w w C (1) 2 t Condtons: w x b 1, 1,2,, t 1 1( w x b) 1, t, t 1,, n (2) 0, t, t 1,, n Where w s the parameter, the greater the value, the more obvous the border; stands for slack varables; b R for bas; C s penalty coeffcent, f a pont belongs to a certan class, but devates from the class and goes to other places on he border. The greater the, whch shows that t does not want to gve up the pont, the more the boundary wll shrnk. In PU learnng, the postve examples are marked out and unlabeled data s defned as ncomplete data. C.Cleanup of ncomplete data Through analyss, ncomplete data s dvded nto the followng categores: (1) Partally-deleted, but auto-complete ncomplete data. (2) The data s complant wth busness rules, but has not been marked as postve examples, such as (www.bbc.co.uk, 10.5.5.108). (3) Partally deleted, and need human nterventon to complement ncomplete data. (4) Null values and other error condtons not mentoned above. To clean up the above data, follow the (1) - (4) n order. Clean-up steps are as follows: 681

Step1: Compare the ncomplete data and data labeled as postve examples. If the key substrngs match exactly, then use the data n postve examples to complement the ncomplete data. Step2: Analyze the characterstcs of the data, redefne flterng rules, and turn the legal data nto a complete data by manual nterventon. Step3: Treat ncomplete data n Step3 by human nterventon. Step4: Mark the results of the data acqured from the three steps as labeled postve examples. Step5: Remove all unlabeled data from the data set. Experment Results and Analyss Based on partal data cleanup, supervsed learnng experment s conducted under Wndows 7 operatng system envronment, wth MyEclpse development tools and n Java language. Use LbSVM package, parameters nvolved n the regulaton of the software s relatvely small, n ths study; all experments were performed usng the default parameters. The tranng set has 1000 browse log fles and the testng has 3000 browse log fles. Each fle has about 0-2021 records and the sze about 1k-3000k. To reduce the mpact of nose on the system, before the start of LbSVM experment, all log records are processed to remove stop words. In ths study, a total of 22 stop words are set ncludng www, http,.com. In order to verfy the effectveness of extracton rules and partally supervsed learnng, ths artcle evaluated data cleanng effect n the experment. For valdaton, comparsons were conducted 10 tmes durng the experment wth randomly selected tranng set of log fles (the number of fles s {100,200,..., 1000})and two performance metrcs (p, r), where p s precson, r s recall. Table Ⅱ compares the performance based on dfferent tranng set. Experment results show that wth the ncrease of tranng documents, the mpact of ncreasng the number of fles on the results s low, and may be abnormal. For ths experment, the number of tranng fles between 500 and 700 s most preferred. Table.2 Performance Comparson Lst Tranng document number p r 100 0.9579 0.7100 300 0.9289 0.7854 500 0.9253 0.8195 600 0.9131 0.8344 700 0.9023 0.8415 800 0.8982 0.8242 1000 0.8945 0.8373 Conclusons The results of ths experment show that the rule-based learnng and supervsed learnng can both mprove the effcency of data preprocessng. Ths paper dscussed n detal the data cleanng process based on partally supervsed learnng, and through the experment, the valdty of the method employed s verfed, and the optmal number of tranng documents s ultmately determned. In ths study, although Web log mnng data preprocessng related problems are studed, yet there s stll need for further research and mprovement: Frst, optmzaton of LbSVM parameters. Second, further research of data transformaton and reducton methods s needed to complete data preprocessng. Acknowledgment Ths work s supported by the Natural Scence Foundaton of Chna (NO.61563040), Natural Scence Foundaton of Inner Mongola, Chna (NO. 2016ZD06), Natural Scence Foundaton n Unverstes of Inner Mongola Autonomous Regon, Chna (NO.NJZY14334). 682

References [1]WANG Sh,GAO Wen, LI Jn-Tao.Path clusterng: dscoverng the knowledgen the web ste. Journal of Computer Research and Development,vol.04,2001.pp.482-486. [2]Lenzern, M.. Data ntegraton: A theoretcal perspectve[c].//in: Popa, L. (ed.). Proceedngs of the Twenty-frst ACM SIGACT SIGMOD-SIGART Symposum on Prncples of Database Systems (PODS 2002), 2002: 233-246. [3]Cszak, L.. Applcatons of clusterng and assocaton methods n data cleanng[c].//in: Proceedngs of the Internatonal Multconference on Computer Scence and InformatonTechnology, 2008,03: 97-103. [4]Pyle, D. Data Preparaton for Data Mnng[M]. Morgan Kaufmann Publshers Inc., San Francsco, CA. 1999: 540. [5]R. Cooley, B. Mobasher and J. Srvastava. Data preparaton for mnng World Wde Web browsng patterns[j]. Journal of Knowledge and Informaton Systems, 1999, 1 (1):5-32. [6]Tanasa, D.and B.Trousse.Advanced data preprocessng for nterstes web usage mnng[j]. Intellgent Systems, IEEE, 2005,19 (2): 59-65. [7]ING Chang-bn and Chen L. Web Log Data Preprocessng Based On Collaboratve Flterng[C].// In:Proceedngs of the IEEE 2nd Internatonal Workshop On Educaton Technology and Computer Scence, 2010:118-121. [8]DS Ngu, X. Wu. Stehelper: A localzed agent that helps ncremental exploraton of the World Wde Web[C]. //In: Proceedngs of the 6th Internatonal World Wde Web Conference, Santa Clara, 1997:691-700. [9]B.Lu, Y.Da, X.L, W.lee, and P.Yu.Buldng text classfers usng postve and unlabeled examples[c].//in: Proceedngs of the Thrd IEEE Internatonal Conference on Data Mnng (ICDM-2003), 2003:19-22. 683