A MINING TECHNIQUE FOR WEB DATA USING CLUSTERING

Size: px

Start display at page:

Download "A MINING TECHNIQUE FOR WEB DATA USING CLUSTERING"

Lionel Gallagher
5 years ago
Views:

1 A MINING TECHNIQUE FOR WEB DATA USING CLUSTERING Ms. Chhaya M.Meshram 1, Prof. Rahila Sheikh 2 1 B.D.C.O.E. Sevagram, 2 R.G.C.E.R.T. Chandrapur Abstract- Web text mining is an important branch in the data mining. Text mining is the process of searching large volumes of documents from certain keywords or key phrases. An extension of text mining is web mining. Web mining is an exciting new field that integrates data and text mining within a website. It enhances the web site with intelligent behavior, such as suggesting related links or recommending new products to the consumer. One of tbe data mining activities which involve extracting meaningful new information from the data is classification & clustering technique. ing enables one to discover hidden similarity and key concepts. Any clustering technique relies on concepts such as a data representation model, a similarity measure, a cluster model, a clustering algorithm. The classification technique is a kind of data analysis form, which can be used to gather and describe important data set. This method is used to estimate the Categorical Label of data object. The objective of this paper is to provide a new Web Text Mining Model which include query directed web page clustering algorithm & vector space model. Keywords- ing, Text Mining, VSM, Web Text Mining 1. INTRODUCTION The World Wide Web is rapidly emerging as an important medium for the dissemination of information related to wide range of topics. This increases need of techniques to unveil inherent structure in the underlined data. ing is one of these. ing enables one to discover hidden similarity and key concepts. Any clustering technique relies on concepts such as a data representation model, a similarity measure, a cluster model, a clustering algorithm. Web search is difficult because it is hard for users to construct queries that are both sufficiently descriptive and sufficiently discriminating to find just the web pages that are relevant to the user s search goal. Queries are often ambiguous: words and phrases are frequently polysemantic and user search goals are often narrower in scope than the queries used to express them. This ambiguity leads to search result sets containing distinct page groups that meet different user search goals. Often users must refine their search by modifying the query to filter out the irrelevant results. Users must understand the result set to refine queries effectively; but this is time consuming, if the result set is unorganized. Web page clustering is one approach for assisting users to both comprehend the result set and to refine the query. Web page clustering algorithms identify semantically meaningful groups of web pages and present these to the user as clusters. The clusters provide an overview of the contents of the result set and when a cluster is selected the result set is refined to just the relevant pages in that cluster. ing performance is very important for usability. If cluster quality is poor, the clusters will be semantically meaningless or will contain many irrelevant pages. If cluster coverage is poor, then clusters representing useful groups of pages will be missing or the clusters will be missing many relevant pages. A query directed web page clustering algorithm that gives better clustering performance than other clustering algorithms. It has five key innovations: a new query directed cluster quality guide that uses the relationship between clusters and the query, an improved cluster merging method that generates semantically coherent clusters by using cluster description similarity in additional to cluster overlap, a new cluster splitting method that fixes the cluster chaining (drifting) problem, an improved heuristic for cluster selection that uses the query directed cluster quality guide, and a new method of improving clusters by ranking the pages by relevance to the cluster. The objective of this paper is to provide a new web text mining model. The section2 describes new web text mining model. Section3 describes new technique for web page clustering & Vector Space Model used for showing similarity between query & document. Experiments, interpretation and discussion are presented in section4. Section5 provides conclusion. 2. WEB TEXT MINING MODEL The proposed web text mining model consists of different phases. Firstly query is given to search engine then we get 100 URL s related to that query. Summary of extracted URL s is then passes through various preprocessing phases. Then different clusters are formed by using query directed web page clustering algorithm. Then vector space model is used for showing similarity between query & document. ISSN: Page 240

2 Download web Documents Input Query (Phrase) performance at the cost of algorithm speed. Algorithm computes the query distance of each base cluster the distance from the query, using NGD. NGD(x, y) = max {log f(x), log f(y)} log f(x, y) log min {log f(x), log f(y)} Where f(x) and f(y) are the number of hits of words x and y, respectively, and M is the total number of web pages that Google indexes. Symbol List Lower Case conver Symbol Filter Tag Filter Stop Words Filter Stop words list Stemme r Word Net Base ing Merging splitting selection Cleaning 3.2 Merging Algorithm constructs larger clusters by merging clusters together. Each cluster (c) is constructed from a set of base clusters and a cluster is described by the word that describes the cluster s largest base cluster. However, the set of pages in a cluster is not necessarily all the pages in its base clusters. A page is only included in the cluster If it is present in enough of the base clusters in the cluster. This threshold should increase with the number of base clusters in the cluster, but should not increase steeply. Algorithm uses a log function. A cluster is a set that contains the pages that are in at least log2 ( base(c) + 1) of the cluster s base clusters. Initially there is a singleton cluster for each base cluster. Algorithm merges clusters using single-link clustering over relatedness Graph. Single-link clustering merges together all clusters that are part of the same connected component on the graph. The relatedness graph has the clusters as vertices and has an edge between any two clusters that are sufficiently similar. VSM algorithm for finding similarity between query & document List of clusters along with Documents 3. WEB PAGE CLUSTERING The proposed web text mining system is implemented by using query directed web page clustering algorithm. This algorithm gives better clustering performance than other clustering algorithms. Initially this algorithm having single word as query but in proposed system we can use multiple word as query. Algorithm has five key innovations as below. 3.1 Base ing A base cluster is described by a single word and consists of all the pages containing that word. Equivalently, base clusters are single word search refinements based on the Current search results. After standard page preprocessing, this algorithm constructs a collection of base clusters, one for every word that is in at least 4% of the pages. Using a lower threshold will increase clustering 3.3 Splitting Each cluster now contains at least all the base clusters that relate to one idea; this is assured as single-link clustering merges all related clusters. But single-link clustering, even with our improved similarity function, can produce clusters containing multiple ideas and irrelevant base clusters due to cluster chaining (drifting). Such clusters need to be split. Interestingly, it is easier to split such a compound cluster than to prevent its formation in the first place; because the splitting can take into account the final cluster, whereas the merging process cannot. Algorithm uses a distance measure with three components: the number of paths between the two subclusters on the relatedness graph of length one (one links), or of length two (two links), and the average distance from base clusters in one sub-cluster to base clusters in the other sub-cluster. Dist (c1, c2) = onelinks+0.5 twolinks avgdist (c1, c2) avidest(c1, c2) =P b 12 base(c1) Pb22base(c2) Len(b1, b2) base(c1) base(c2) Where Len (b1, b2) is the path length between two base clusters in the relatedness graph. ISSN: Page 241

3 3.4 Selection At this stage, algorithm has a small set of coherent clusters. However, there will still be more clusters than can be presented to the user. Algorithm needs to select the best subset of the clusters to present to the user. Ideally, these clusters should be high quality clusters that cover all the pages in the original set with minimal overlap. Algorithm uses the ESTC cluster selection algorithm [36] with an improved heuristic, H(C), to select a set of clusters to Show the user. The ESTC cluster selection algorithm uses the heuristic with a 3-step look-ahead hillclimbing search to select a set of clusters to present to the user. To evaluate a candidate set of clusters, C, the new heuristic considers the number of pages covered by the clusters (CP), the number of distinct pages covered by the clusters (CD), the number of pages not covered by any of the clusters (CO), and the quality of each cluster (q(c)).h(c) =Xc2Cq(c)! _CO _(CP CD). 3.5 Selection Base clusters are sometimes formed from polysemous words and therefore clusters can contain pages that cover different topics. Since the clusters should relate to only one topic, pages from other topics are irrelevant. Algorithm computes the relevance of each page in each cluster and removes irrelevant pages. The relevance of a page to a cluster is based on the number and size of the cluster s base clusters of which it is a member. Page relevance varies between 0 and 1, with 0 being a page that is completely irrelevant to the cluster. Page relevance is computed as the sum of the sizes of the cluster s base clusters of which it is a member, divided by the sum of the sizes of all of the cluster s base clusters. Relevance (p, c) =P {b b2base(c) ^p2b} b Pb2base(c) b 4. MEASUREMENTS, EXPERIMENTS & RESULTS This paper has presented a new Web text mining model. This model is based on clustering. 4.1 Measurements ing have been evaluated using a wide variety of measurements. Purity of cluster are based on three standard information retrieval measures: precision, recall, and f-measure. P(c,t) = Precision= D c,t / D c R(c,t) = Recall = D c,t / D t F(c,t) = F-measure = (2*P(c,t) * R(c,t) ) / (P(c,t) + R(c,t) Where C-> is a set of clusters T-> is a set of topics D-> is a set of pages D c -> is the pages in cluster c D t -> is the pages in topic t D c,t -> pages in cluster c of topic t. Purity assumes that a cluster represents the topic with the highest precision. F assumes that a cluster represents the topic with the highest f-measure. Entropy & Measurement are also used for measurement of cluster. 4.2 Experiments Fig:- Precision & Recall Fig:- Entropy & Mutual Information 4.3 Results A. Graphical User Interface GUI consists of following components. 1. Web links component will show the list of all downloaded web pages. 2. Search button along with text field. 3. A list with base clusters ( List). 4. A table which contain clusters related to respective clusters. ISSN: Page 242

5. A list of all URLs for a single cluster (Document List). 6.

Application is showing total web pages searched and filtered web pages that are related with the selected cluster. C.

Fig:- Graphical User Interface B. URLs for All downloaded web pages The database for phrases consists of 100 web pages.

4 5. A list of all URLs for a single cluster (Document List). 6. Four types of counts showing total URLs searched (Total Links), URLs for a single cluster (Document Links), time taken by Search Engine as well as QDC. 7. Show Results button. Application is showing total web pages searched and filtered web pages that are related with the selected cluster. C. Application showing cluster results The application is showing the list of web links related to given query. Now double click on links shown in Web Links to see the Link Summary Dialog. Fig:- Graphical User Interface B. URLs for All downloaded web pages The database for phrases consists of 100 web pages. Whenever query phrase is entered by user the system checks for the pages from database that contain phrase and keep all pages in document list. The different interpretations of phrase sachin tendulkar cricket are sachin, tendulkar, cricket, sachin tendulkar, sachin tendulkar cricket. Fig:- Application showing cluster results The application is showing the list in the given right hand side table of related cluster those relate with the given query. Now Double click on the give single row of the table to see the similarity value of the given cluster present in given documents. Fig:- URLs for All downloaded web pages After completion of search process, a list of clusters is shown i.e. different interpretations. Application is showing the time taken by Search Engine and the algorithm. Fig:- Showing live web page which is related to searched phrase query ISSN: Page 243

5 5. CONCLUSION This paper has presented a new web text mining model. It includes combination of web page clustering algorithm, VSM model & uses relationship between clusters to show classification. ing algorithm has five key innovations. Firstly, it identifies better clusters using a query directed cluster quality guide that considers the relationship between a cluster s descriptive terms and the query terms. Secondly, it increases the merging of semantically related clusters and decreases the merging of semantically unrelated clusters by comparing the descriptions of clusters in addition to comparing the overlap of page contents between clusters. Thirdly, it fixed the cluster chaining (drifting) problem using a new cluster splitting method. Fourthly, it chooses better clusters to show the user by improving the ESTC cluster selection heuristic to consider the number of clusters to select and cluster quality. Finally, it improves the clusters by ranking the pages according to cluster relevance. We can give phrase as query to this model. Finally it shows the relationship between clusters in tree format. 6. REFERENCES [1]. Jingfeng Zhang, Ming Zhen, Yan Wu, the Script Language and Dynamic Web Page Designing, China Water Power Press (2004) (in Chinese) [2]. Chen xiaoyun, Text Categorization Based on Classification Rules Tree by Frequent Patterns. Journal of Software, Vol.17, No.5, pp , [3]. Zhili Zhou, Renwu Wang, A Study of Web Data Automation Extraction and its Application, E- COMMERCE,4(2006) 58-63(in Chinese) [4]. Weiha Feng, Zhangfeng Mao, The Research of Web Pages Information Extraction Based on Web, Journal of Luoyang Technology College,3(2005) (in Chinese) [5]. Shentao Li, Design and Realization of Focused Web Crawler, Chinese Academy of Sciences, 1-3(2002) (in Chinese) [6]. Boyi Xu, Jing Wang, Hongming Cai A Web Page Classification Algorithm and Its Application in E-government System, Seventh International Conference on Fuzzy Systems and Knowledge Discovery (2010) [7]. Shiqun Yin, Fang Wang, Zhong Xie, Yuhui Qiu, Study on Webpage Classification Algorithm Based on Rough Set Theory, International Symposiums on Information Processing (2008) [8]. G.S. Tomar,Shekhar Verma, Ashish Jha, Web Page Classification using Modified Naïve BayesianApproach, International conference on Research and Development in Information Retrieva, 2007 [9]. Pável Calado, et.al., Combining link-based and Content-based methods for web document Classification, in Proceedings of the twelfth International conference on Information and knowledge Management, 2003, pp [10]. Lin Zhang, Yan Chen, Yan Liang, Nan Li, Application Of Data Mining Classification Algorithms in Customer Membership Card Classification Model, International Conference on Information Management, Innovation Management and Industrial Engineering,2008 [11]. Bai Xingli, Zhang Yuanping, The Research on an Improved Fast SVM Classification Algorithm, Second International Symposium on Computational Intelligence and Design (2009) [12]. Zheng Tan, Hanhu Wang, Mei Chen Improved CBA Classification Algorithm Based on Rough Set, IEEE, 2009 [13]. Zehra Cataltepe, Eser Aygun, An Improvement of Centroid-Based Classification Algorithm for Text Classification, IEEE, 2007 [14]. K. P. Bennett and A. Demerit. Semi-supervised support vector machines. In M. S. Kearns, S. A. Sololá, and D. A. Cohn, editors, Advances in Neural Information Processing Systems -10-, pages 368{374, Cambridge, MA, MIT Press. [15]. J. C. Bedeck. Pattern Recognition with Fuzzy Objective Function Alga-Rhythms. New York, [16] US Census Bureau. Adult dataset. Publicly available from ISSN: Page 244

REMOVAL OF REDUNDANT AND IRRELEVANT DATA FROM TRAINING DATASETS USING SPEEDY FEATURE SELECTION METHOD

REMOVAL OF REDUNDANT AND IRRELEVANT DATA FROM TRAINING DATASETS USING SPEEDY FEATURE SELECTION METHOD Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 5.258 IJCSMC,