Automatic Web Page Categorization using Principal Component Analysis

Size: px

Start display at page:

Download "Automatic Web Page Categorization using Principal Component Analysis"

Della Sims
5 years ago
Views:

1 Automatic Web Page Categorization using Principal Component Analysis Richong Zhang, Michael Shepherd, Jack Duffy, Carolyn Watters Faculty of Computer Science Dalhousie University Halifax, Nova Scotia, Canada B3H 3L7 Abstract Today s search engines retrieve tens of thousands of web pages in response to fairly simple query articulations. These pages are retrieved on the basis of the query terms occurring in the web pages and the popularity of the web pages as per the link structure of the web. However, these search engines do not take into account the broader information need of the user, such as the task in which the user is involved. This research investigates the automatic categorization of web pages using Principal Component Analysis. The research focuses on user tasks that involve searching for web pages containing health information, education information or shopping information. Initial results are encouraging with recall and precision values slightly in excess of 80%. 1. Introduction Search Engines help us to retrieve web pages that satisfy the specifics of a user s query. The result set, however, is often so large that it is hard to find the page or pages that will actually satisfy the user s information need. One approach to increasing the relevance of the results is to categorize the results in anticipation of the user need. While this would not necessarily reduce the search engine effort, per se, it would allow the user to cope with the results of the search much faster and more accurately. Although search engines could provide this filtering, the categorization could also be done at the user s site on an as needed basis. In this paper we discuss ongoing work to categorize web pages on the basis of automatic interpretation of the author s intent for three categories of web pages related to user task; pages about health information, shopping pages, and pages about education. An approach based on Principal Component Analysis (PCA) has been investigated to develop a method to automatically classify web pages based on task. We use the metric, Information Gain, for feature selection in order to reduce the dimensionality of the document vectors. PCA is then used to map these document vector dimensions into a smaller dimensional space. Individual web documents are then projected into the new space for the purpose of classification. The results of a series of experiments show that this method is able to classify web pages efficiently for these three tasks categories. Section 2 of this paper briefly reviews related research, Section 3 describes the methodology of our research, Section 4 discusses the experimental results, and Section 5 summarizes the paper and discusses future research. 2. Related research The need for methods of classifying Web pages has been recognized for some time as a way to reduce the difficulty of users coping with very large search engine results. Prakash et al. [9] introduced a method to classify web pages based on document structure for university web pages. They proposed a method for the automatic classification of web pages into a few broad categories, including information pages, research pages, and personal home pages. Their method was based on the text content, images, links, videos and other structures of the web document. After testing about 4000 web pages from universities and other domains, 87.83% web pages were correctly categorized. Kan and Thi [7] also used a set of university web pages to show that they could be successfully classified using full text plus the uniform resource locator (URL). They took a subset of the WebKB corpus as the data set and the web pages were classified into student, faculty, course and project pages using a Support Vector Machine (SVM) based on maximum entropy to define the feature set. Previous work has also addressed the issue of categorizing general web pages. Chekuri et al. [2] automatically classified web pages into pre-specified YAHOO categories. They randomly selected 2000 web pages from 20 YAHOO categories and, after training the automatic classifier, they tested 500 new web pages from the same 20 YAHOO categories. They calculated the probability of a document being assigned to each category and ranked the pages by their probability. The result was /07 $ IEEE 1

2 that more than 50% of the test web pages were classified into the correct YAHOO category. Chakrabarti et al. [1] used text and hyperlink features together to build a web page classifier and found this method could significantly improve the accuracy. In addition to the content of the document, they included the classification of the neighbors in the evaluation of the class of each document. The inclusion of the close neighbors of the test document significantly boosted their classification accuracy and they reported a 70% reduction in classification error compared to text only classification. Shen and Chen [11] compared a web page summarization method with the traditional text classification method. They used a Naïve Bayes Classifier and a Support Vector Machine for the baseline classifications using the text content of the Web pages. Their data set included 153,019 pages, distributed over 64 categories from the top two levels on the LookSmart Website. Their summarization method was based on Latent Semantic Analysis (LSA) using terms from the content of the Web pages. Their results indicated that the web pages that were classified based on the summaries produced by the human editors were significantly better (a 13.2% improvement on micro-f1 measure) than using only the text of the Web pages. Experimental results also showed that their automatic summary process could achieve a similar improvement for classification (about 12.9% improvement) [11] over text alone. Another recent approach is to compare web pages against all possible categories and place pages in the class with the highest probability. Peng and Choi [3] used class hierarchies to improve accuracy about 6 percent over similar systems. In the research reported in this paper, our goal is to categorize Web pages quickly into a small number of predefined categories, where the categories are user and user task dependent. For example, when the user is shopping then we are not interested in identifying or discriminating among the other possible categories, we are only interested in quickly identifying shopping pages. Therefore, the system should quickly indicate whether or not a page is a member of the shopping category. 3.1 Dataset The target classes were chosen, arbitrarily, to be Shopping, Health, and Education. As we wanted to generate a dataset consisting of web pages in these three categories, we looked to the YAHOO categories. YAHOO manually classifies web pages into a set of predefined categories {Figure 1). Therefore, we could randomly select a set of web pages, each of which had a known class. The final data set of 430 web pages included 120 web pages, selected randomly, from each of the YAHOO categories of Business & Economy >Shopping, Health, and Education. We also selected 70 web pages to represent noise, i.e., web pages that do not belong to the Shopping, Health or Education categories. These noise pages were selected randomly from the YAHOO categories of Auto Magazine, Calendar, Events, Young Adult, Art History, Election, Games, Sports news and media, Weather, and Animals. All 430 web pages were examined by three raters to determine if there was agreement on the YAHOO assigned categories. The noise web pages were also examined to confirm that they were in fact, not Shopping, Health or Education pages. Web page selection continued until all three raters agreed on all 430 web pages. 3. Methodology The methodology followed in this research consisted of selecting a random set of web pages from selected YAHOO categories to form a data set, cleaning this data set, determining a set of features to represent the data set, building a document-term matrix, applying Principal Component Analysis to weight the features, categorize the Web pages in the test set, and evaluation of the resulting categorization. Figure 1. YAHOO categories. 2

3 3.2 Data cleaning The categorization approach in this research was based solely on content, i.e., key words. Therefore, it was necessary to remove all HTML tags and images, etc., from all of the web pages. All remaining words were converted to lowercase, stop words [12] were removed and the remaining words were stemmed with Porter s algorithm [8]. This resulted in 10,985 unique word stems. 3.3 Feature selection Feature Selection, widely used in pattern recognition and data mining, selects a set of features based on some criteria of the features such that the resulting smaller set has a high representational capability. Feature selection reduces the number of features, in this case keywords, needed for processing, such that processing time is reduced. In our case the initial feature set consisted of all 10,985 unique word stems. Information Gain (IG) [14, 15], an information theoretic measure, was used to rank the features so that a threshold could be established above which the features were select for the reduced set of features. The IG measure is based on the entropy associated with a feature (word stem) with respect to its ability to correctly predict what category a given document occurs in. It is given by: IG( t) = Where: m i= 1 + P( t) P( c )log P( + P( t ) i c i m i= 1 m i= 1 i ) P( c t)log P( c t) P( c t )log P( c t ) i P( c i ) is the probability of a document of class i occurring P( c i t) is the probability of a document of class i occurring given it contains term t P( c i t ) is the probability of a document of class i occurring given that the document does not contain term t i i The IG was calculated for each term in the term set derived from the web pages from the Shopping, Health and Education categories after the data cleaning process. Table 1 shows the top 20 features (word stems) as determined by the IG values (shown truncated to 3 decimal places). Also shown are the number of web pages in each of the three categories in which each feature occurs. Table 1. Top 20 features by IG value Health Shopping Education IG value Educ Diseas Medic Health Teacher School Price Item Ship Student Custom Accessori Cancer Doctor Public Shop Heart Cart Medicin Physician Once each of the 10,985 features was assigned an IG value, it was possible to select the best or most discriminating features. Figure 2 shows a plot of the IG values in descending order. The Y-axis represents the IG value and the X-axis the rank of the features. As can be seen, there is rapid decrease in the initial set of IG values before the curve starts to flatten out representing much smaller differences among those features with respect to their ability to differentiate among the classes. After some experimentation, the threshold was established using the features with the top 300 IG values in the final feature set. 3

4 Information Gain Information Gain In applying PCA, we calculated the covariance matrix of the feature set that had been reduced using Information Gain, and then calculated the eigenvalues and eigenvectors of this covariance matrix. The largest eigenvalue identifies the eigenvector that expresses the most significant relationship among the data dimensions. This eigenvector is the first principal component, which is then chosen as the most significant component. The results of the PCA analysis of our reduced document-term matrix, indicated that the first three eigenvectors carry most of the information (Figure 3) Figure 2. Plot of IG values. Eigenvalues 3.4 Principal Component Analysis The reduced feature set was used to create a documentterm matrix to represent the 360 documents from the Shopping, Health and Education classes. The resulting 360 by 300 matrix represents each web page as a vector of 300 columns. The values in the document-term matrix are binary, representing the simple occurrence or nonoccurrence of that feature in that web page. The tf.idf [10] weighting scheme was also investigated but the results were not significantly different from those using binary weights. Principal Component Analysis (PCA) was applied to this binary matrix to determine if we could distinguish among the three categories using this data. Principal Components Analysis [13] is a technique for simplifying a dataset. It is a linear transformation of the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. PCA can be used for dimensionality reduction in a dataset while retaining those characteristics of the dataset that contribute most to its variance, by keeping lowerorder principal components and ignoring higher-order ones. Such low-order components often contain the most important aspects of the data. One advantage of PCA, important to this research, is that once the patterns in the data have been found, the number of dimensions may be reduced without much subsequent loss of information. PCA is similar to Latent Semantic Indexing [4] in that the dimensionality is reduced and then the terms and documents can be projected into this reduced space. First 3 eigenvectors carry most of the information Eigenvectors Figure 3. PCA generated eigenvalues Consequently, we used the PCA results to project the 360x300 document-term matrix into a 360x3 matrix. The resulting 3-dimensional graph (Figure 4) shows that the Shopping, Health and Education web pages do appear to cluster along these three principal components (eigenvectors). The circles represent the Shopping web pages, the stars the Health web pages and the plus signs represent the Education web pages. 3.5 Decision tree categorization After the PCA, each web page was represented in the reduced vector space by the co-ordinates of the three eigenvectors. The c4.5 decision tree package [6] was then used to analyze the resulting projection and to extract a set of rules for the decision tree. This decision tree was then used to classify new web pages into one of the three categories and a NOISE category. 4

5 Figure 4. PCA plot of web pages. 3.6 Experimental procedure The 10-fold cross validation process was used for the evaluation experiments. This process is often used to give statistical validity to situations where the data sets are small. In this process, the 430 web pages consisting of 120 Shopping, 120 Health, 120 Education, 70 Not-Shopping- Health-Education (i.e., noise) web pages. These pages were divided randomly into 10 equal partitions. The web pages assigned to each partition represented the distribution of the categories of web pages over the entire data set. One partition was held out as the test set and the other nine partitions are the training set. The classifier, in this case the PCA projection matrix, was trained on the training set and tested on a test set of data. This was repeated for a total of 10 evaluations, each partition being held out in turn as the test set. The results for all of the iterations were then averaged to give the final results. The following steps were followed for each iteration of the 10-fold cross validation process in our experiment: For the training set, Information Gain was calculated for terms in the known Shopping, Health and Education categories. Features were selected from the top 300 Information Gain values to create the reduced feature set. The document-term matrix was generated for all web pages in the training set, including the noise category of web pages. PCA was applied to the matrix and only the eigenvectors associated with the largest three eigenvalues were kept. The c4.5 decision tree was run on the projected matrix to determine the rules for building a decision tree and the tree was generated. The test set of data was projected into the PCA eigenvector space. 5

6 The decision tree was applied to the projected test set to categorize each web page of the test set as Shopping, Health, Education or noise. 4. Results and discussion The results of the 10-fold cross validation process are presented in the confusion matrix of Table 2. A confusion matrix presents a view of both correct and incorrect classifications. The rows represent the original or correct categories and the columns represent the assigned categories. The distribution of web pages in each of the 10 partitions was 12 web pages of each of Health, Shopping and Education and 7 web pages of noise. A perfect system would have values only on the diagonal. Each cell of Table 2 shows the average number of web pages of the original category assigned to that target category, and the standard deviation in parenthesis. For instance, the average number of health web pages assigned to the health category over the 10 iterations was 10.0 with a standard deviation of The average number of health web pages (incorrectly) assigned to the shopping category was 0.8 with a standard deviation of A perfect system would have assigned 12 web pages and 0 web pages, respectively, to these two categories. Original Categories Table 2. Confusion matrix for test data Health Shopping Education Noise Health 10.0 (1.15) 0.8 (0.92) 0.9 (0.57) 0.6 (0.70) Assigned Categories Shopping 0.8 (1.13) 9.9 (1.60) 0.1 (0.32) 1.1 (1.66) Education 0.5 (0.70) 0.1 (0.32) 9.2 (0.63) 1.6 (1.51) Noise 0.71 (1.06) 1.2 (1.03) 1.8 (0.79) 3.7 (1.25) A Chi-Square analysis of this confusion matrix found that the distribution was significant at p=0.001 (df=9, χ = 59.00). These categorization results were calculated using the recall and precision measures. Recall is the proportion of web pages that should be in a particular category that are correctly assigned to that category. Precision is the proportion of web pages that are assigned to a particular category that should be in that category. The recall and precision results are shown in Table 3. Table 3. Recall and precision values. Category Recall Precision Shopping Health Education Noise The overall results for web pages assigned to the Shopping, Health and Education classes are approximately 80% for both recall and precision. Note that recall and precision for the noise class are approximately 50%, indicating that the classifier was not able to categorize those web pages, which represented noise in our data set, from those classes on which it had not been trained. Of particular interest is the analysis of the 3- dimensional plot of the projected data with the noise or noise category (Figure 5. The area around the origin in Figure 5, as indicated by the circle, contains the majority of the noise web pages plus those web pages that were weakly classified. These were web pages that had few, if any, of the 300 features representing the original data set (for example, the web page shown in Figure 6). Those web pages represented by positions that lie on a principal component axis and distant from the origin, such as the point labeled Strong Health (for example, the web page shown in Figure 7), were those web pages that contained a large number of features from a single category and very few from any other category. Those web pages represented by positions that appear to be distant from the origin but do not lie on a principal component axis tended to be those web pages that could be classified into more than one class. These web pages contained features that represent more than a single category. An example in Figure 5 is the point labeled Health with Shopping (for example, the web page shown Figure 8). The web page shown in Figure 8 was classed by YAHOO and the three raters as Health but was in fact about health products for sale and thus contained a number of features that represented the Shopping category. 6

7 Health with Shopping Strong Health Figure 5. PCA plot with noise data. Figure 6. Weakly classified as health web page. 7

8 Figure 7. Strongly classified as a health web page. Figure 8. Classified as health web page but with some shopping characteristics. 8

9 5. Summary and future research In summary, this investigation into automatic web page categorization using Principal Component Analysis has shown promising results. Both recall and precision were slightly over 80%. However, there are a number of limitations and areas that will require further research. In particular, these include the issues of scale, feature set selection, classification of web pages that may belong to two or more categories, and the recognition of new classes. On the issue of scale, the web is huge and we will have to increase both the number of target categories and the number web of pages classified. Although our results are statistically valid, they cannot be used to infer that this approach will produce similar results for much larger data sets or for all categories. The feature set selection is also important. In this initial research we used only the content words and ignored other features such as HTML tags and links which have given other researchers good results. Since authorship of web pages is out of the control of search engines and users, it is not surprising that many web pages could reasonably belong to two or more classes or categories. Although from Figure 5 it would appear that the three categories are well separated (and thus our good results), there was some amount of overlap among the sets. This overlap is exemplified by such web pages as those about health but with some shopping characteristics and education pages that led to health education programs, etc. Refined decision tree analysis may be able to recognize when this occurs and correctly identify the categories involved and assign web pages to multiple categories. Our initial results indicate that we are able to recognize web pages that do belong to any of the target classes with an accuracy of about 50%. This may happen when noise pages are somewhat similar to a known class and may also occur when the features of known classes begin to change as the web continues to evolve. Research is also needed to recognize when a new class has been developed either as a novel class or as a derivative of an existing class. It is expected that increasing the proportion of noise web pages will adversely effect the precision of the classifier. This will be addressed in future research with particular attention to tuning the decision tree rules. Our approach in this research has been feature set selection for the target categories, principal component analysis to reduce the dimensionality and to determine the principal components and, finally, the development and application of a decision tree to categorize the test web pages once projected into the reduced space. Further research will evaluate this approach as compared to other categorization approaches. In particular, we will evaluate our approach against unsupervised learning techniques such as K-means clustering. Ding and He have shown that, principal components are the continuous solutions to the discrete cluster membership indicators for K-means clustering. [5] By selecting features only from the three target categories we have biased our principal components to reflect these target categories. Similarly, this same feature set should bias the content of the clusters formed in unsupervised learning approaches such as K- means clustering. The fact that principal components are continuous solutions accurately reflects the fuzzy nature of some of the web pages, e.g., health pages with some shopping characteristics, and thus the need for the decision tree. Our ultimate goal is to be able to provide sets of web pages with features that reflect the user s task. Such automatic categorization should be able to help the user cope with larger and larger Web search query results. 6. References [1] Chakrabarti, S., Dom, B. and P. Indyk. Enhanced Hypertext Categorization Using Hyperlinks. Proceedings of ACM SIGMOD International Conference on Management of Data, pp , [2] Chekuri, C., Goldwasser, M.H., Prabhakar, R. and Eli Upfal, Web Search Using Automatic Classification, Proceedings of WWW-96, 6th International Conference on the World Wide Web, 1996 [3] Choi, B. and X. Peng. Dynamic and Hierarchical Classification of Web Pages. Online Information Review. Volume 28 Number 2, pp , 2004 [4] Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K. and R. Harshman. Journal of the American Society for Information Science, Vol. 41, No. 6, oo [5] Ding, C. and X. He. K-means Clustering via Principal Component Analysis, Proceedings of the 21 st International Conference on Machine Learning, Banff, Canada, [6] Han, J. and M. Kamber. Data Mining Concepts and Techniques. Morgan Kaufmann Publishers Inc [7] Min-Yen Kan, M.-Y. and H.O.N. Thi. Fast Webpage Classification Using URL Features Proceeding of Conference on Info and Knowledge Management (CIKM '05). Bremen, Germany, November [8] The Porter Stemming Algorithm 9

10 [9] Prakash, A., Kranthi, A. And R. Kumar. Web Page Classification Based on Document Structure 2001, pcds.pdf [10] Salton G. Automatic Text Processing. McGraw-Hill Book Company, [11] Shen, D., Chen, Z., Zeng, H.-J., Zhang, B., Yang, Q., Ma, W.-Y. and Yuchang Lu. Web-Page Classification through Summarization. Proceedings of the Special Interest Group in Information Retrieval, pp , [12] SMART FTP site: ftp://ftp.cs.cornell.edu/pub/smart/ [13] Wikipedia. nalysis [14] Yang, Y. and Jan O. Pedersen. A Comparative Study on Feature Selection in Text Categorization, Proceedings of ICML-97, 14th International Conference on Machine Learning [15] Zheng, Z., Wu, X. And R.K. Srihar. Feature Selection for Text Categorization on Imbalanced Data, CM SIGKDD Explorations Newsletter Volume 6, Issue 1 (June 2004), Pages:

Feature Selection for an n-gram Approach to Web Page Genre Classification

Feature Selection for an n-gram Approach to Web Page Genre Classification Jane E. Mason Michael Shepherd Jack Duffy Technical Report CS-2009-04 June 22, 2009 Faculty of Computer Science 6050 University