Automatic Web Page Categorization using Principal Component Analysis
|
|
- Della Sims
- 5 years ago
- Views:
Transcription
1 Automatic Web Page Categorization using Principal Component Analysis Richong Zhang, Michael Shepherd, Jack Duffy, Carolyn Watters Faculty of Computer Science Dalhousie University Halifax, Nova Scotia, Canada B3H 3L7 Abstract Today s search engines retrieve tens of thousands of web pages in response to fairly simple query articulations. These pages are retrieved on the basis of the query terms occurring in the web pages and the popularity of the web pages as per the link structure of the web. However, these search engines do not take into account the broader information need of the user, such as the task in which the user is involved. This research investigates the automatic categorization of web pages using Principal Component Analysis. The research focuses on user tasks that involve searching for web pages containing health information, education information or shopping information. Initial results are encouraging with recall and precision values slightly in excess of 80%. 1. Introduction Search Engines help us to retrieve web pages that satisfy the specifics of a user s query. The result set, however, is often so large that it is hard to find the page or pages that will actually satisfy the user s information need. One approach to increasing the relevance of the results is to categorize the results in anticipation of the user need. While this would not necessarily reduce the search engine effort, per se, it would allow the user to cope with the results of the search much faster and more accurately. Although search engines could provide this filtering, the categorization could also be done at the user s site on an as needed basis. In this paper we discuss ongoing work to categorize web pages on the basis of automatic interpretation of the author s intent for three categories of web pages related to user task; pages about health information, shopping pages, and pages about education. An approach based on Principal Component Analysis (PCA) has been investigated to develop a method to automatically classify web pages based on task. We use the metric, Information Gain, for feature selection in order to reduce the dimensionality of the document vectors. PCA is then used to map these document vector dimensions into a smaller dimensional space. Individual web documents are then projected into the new space for the purpose of classification. The results of a series of experiments show that this method is able to classify web pages efficiently for these three tasks categories. Section 2 of this paper briefly reviews related research, Section 3 describes the methodology of our research, Section 4 discusses the experimental results, and Section 5 summarizes the paper and discusses future research. 2. Related research The need for methods of classifying Web pages has been recognized for some time as a way to reduce the difficulty of users coping with very large search engine results. Prakash et al. [9] introduced a method to classify web pages based on document structure for university web pages. They proposed a method for the automatic classification of web pages into a few broad categories, including information pages, research pages, and personal home pages. Their method was based on the text content, images, links, videos and other structures of the web document. After testing about 4000 web pages from universities and other domains, 87.83% web pages were correctly categorized. Kan and Thi [7] also used a set of university web pages to show that they could be successfully classified using full text plus the uniform resource locator (URL). They took a subset of the WebKB corpus as the data set and the web pages were classified into student, faculty, course and project pages using a Support Vector Machine (SVM) based on maximum entropy to define the feature set. Previous work has also addressed the issue of categorizing general web pages. Chekuri et al. [2] automatically classified web pages into pre-specified YAHOO categories. They randomly selected 2000 web pages from 20 YAHOO categories and, after training the automatic classifier, they tested 500 new web pages from the same 20 YAHOO categories. They calculated the probability of a document being assigned to each category and ranked the pages by their probability. The result was /07 $ IEEE 1
2 that more than 50% of the test web pages were classified into the correct YAHOO category. Chakrabarti et al. [1] used text and hyperlink features together to build a web page classifier and found this method could significantly improve the accuracy. In addition to the content of the document, they included the classification of the neighbors in the evaluation of the class of each document. The inclusion of the close neighbors of the test document significantly boosted their classification accuracy and they reported a 70% reduction in classification error compared to text only classification. Shen and Chen [11] compared a web page summarization method with the traditional text classification method. They used a Naïve Bayes Classifier and a Support Vector Machine for the baseline classifications using the text content of the Web pages. Their data set included 153,019 pages, distributed over 64 categories from the top two levels on the LookSmart Website. Their summarization method was based on Latent Semantic Analysis (LSA) using terms from the content of the Web pages. Their results indicated that the web pages that were classified based on the summaries produced by the human editors were significantly better (a 13.2% improvement on micro-f1 measure) than using only the text of the Web pages. Experimental results also showed that their automatic summary process could achieve a similar improvement for classification (about 12.9% improvement) [11] over text alone. Another recent approach is to compare web pages against all possible categories and place pages in the class with the highest probability. Peng and Choi [3] used class hierarchies to improve accuracy about 6 percent over similar systems. In the research reported in this paper, our goal is to categorize Web pages quickly into a small number of predefined categories, where the categories are user and user task dependent. For example, when the user is shopping then we are not interested in identifying or discriminating among the other possible categories, we are only interested in quickly identifying shopping pages. Therefore, the system should quickly indicate whether or not a page is a member of the shopping category. 3.1 Dataset The target classes were chosen, arbitrarily, to be Shopping, Health, and Education. As we wanted to generate a dataset consisting of web pages in these three categories, we looked to the YAHOO categories. YAHOO manually classifies web pages into a set of predefined categories {Figure 1). Therefore, we could randomly select a set of web pages, each of which had a known class. The final data set of 430 web pages included 120 web pages, selected randomly, from each of the YAHOO categories of Business & Economy >Shopping, Health, and Education. We also selected 70 web pages to represent noise, i.e., web pages that do not belong to the Shopping, Health or Education categories. These noise pages were selected randomly from the YAHOO categories of Auto Magazine, Calendar, Events, Young Adult, Art History, Election, Games, Sports news and media, Weather, and Animals. All 430 web pages were examined by three raters to determine if there was agreement on the YAHOO assigned categories. The noise web pages were also examined to confirm that they were in fact, not Shopping, Health or Education pages. Web page selection continued until all three raters agreed on all 430 web pages. 3. Methodology The methodology followed in this research consisted of selecting a random set of web pages from selected YAHOO categories to form a data set, cleaning this data set, determining a set of features to represent the data set, building a document-term matrix, applying Principal Component Analysis to weight the features, categorize the Web pages in the test set, and evaluation of the resulting categorization. Figure 1. YAHOO categories. 2
3 3.2 Data cleaning The categorization approach in this research was based solely on content, i.e., key words. Therefore, it was necessary to remove all HTML tags and images, etc., from all of the web pages. All remaining words were converted to lowercase, stop words [12] were removed and the remaining words were stemmed with Porter s algorithm [8]. This resulted in 10,985 unique word stems. 3.3 Feature selection Feature Selection, widely used in pattern recognition and data mining, selects a set of features based on some criteria of the features such that the resulting smaller set has a high representational capability. Feature selection reduces the number of features, in this case keywords, needed for processing, such that processing time is reduced. In our case the initial feature set consisted of all 10,985 unique word stems. Information Gain (IG) [14, 15], an information theoretic measure, was used to rank the features so that a threshold could be established above which the features were select for the reduced set of features. The IG measure is based on the entropy associated with a feature (word stem) with respect to its ability to correctly predict what category a given document occurs in. It is given by: IG( t) = Where: m i= 1 + P( t) P( c )log P( + P( t ) i c i m i= 1 m i= 1 i ) P( c t)log P( c t) P( c t )log P( c t ) i P( c i ) is the probability of a document of class i occurring P( c i t) is the probability of a document of class i occurring given it contains term t P( c i t ) is the probability of a document of class i occurring given that the document does not contain term t i i The IG was calculated for each term in the term set derived from the web pages from the Shopping, Health and Education categories after the data cleaning process. Table 1 shows the top 20 features (word stems) as determined by the IG values (shown truncated to 3 decimal places). Also shown are the number of web pages in each of the three categories in which each feature occurs. Table 1. Top 20 features by IG value Health Shopping Education IG value Educ Diseas Medic Health Teacher School Price Item Ship Student Custom Accessori Cancer Doctor Public Shop Heart Cart Medicin Physician Once each of the 10,985 features was assigned an IG value, it was possible to select the best or most discriminating features. Figure 2 shows a plot of the IG values in descending order. The Y-axis represents the IG value and the X-axis the rank of the features. As can be seen, there is rapid decrease in the initial set of IG values before the curve starts to flatten out representing much smaller differences among those features with respect to their ability to differentiate among the classes. After some experimentation, the threshold was established using the features with the top 300 IG values in the final feature set. 3
4 Information Gain Information Gain In applying PCA, we calculated the covariance matrix of the feature set that had been reduced using Information Gain, and then calculated the eigenvalues and eigenvectors of this covariance matrix. The largest eigenvalue identifies the eigenvector that expresses the most significant relationship among the data dimensions. This eigenvector is the first principal component, which is then chosen as the most significant component. The results of the PCA analysis of our reduced document-term matrix, indicated that the first three eigenvectors carry most of the information (Figure 3) Figure 2. Plot of IG values. Eigenvalues 3.4 Principal Component Analysis The reduced feature set was used to create a documentterm matrix to represent the 360 documents from the Shopping, Health and Education classes. The resulting 360 by 300 matrix represents each web page as a vector of 300 columns. The values in the document-term matrix are binary, representing the simple occurrence or nonoccurrence of that feature in that web page. The tf.idf [10] weighting scheme was also investigated but the results were not significantly different from those using binary weights. Principal Component Analysis (PCA) was applied to this binary matrix to determine if we could distinguish among the three categories using this data. Principal Components Analysis [13] is a technique for simplifying a dataset. It is a linear transformation of the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. PCA can be used for dimensionality reduction in a dataset while retaining those characteristics of the dataset that contribute most to its variance, by keeping lowerorder principal components and ignoring higher-order ones. Such low-order components often contain the most important aspects of the data. One advantage of PCA, important to this research, is that once the patterns in the data have been found, the number of dimensions may be reduced without much subsequent loss of information. PCA is similar to Latent Semantic Indexing [4] in that the dimensionality is reduced and then the terms and documents can be projected into this reduced space. First 3 eigenvectors carry most of the information Eigenvectors Figure 3. PCA generated eigenvalues Consequently, we used the PCA results to project the 360x300 document-term matrix into a 360x3 matrix. The resulting 3-dimensional graph (Figure 4) shows that the Shopping, Health and Education web pages do appear to cluster along these three principal components (eigenvectors). The circles represent the Shopping web pages, the stars the Health web pages and the plus signs represent the Education web pages. 3.5 Decision tree categorization After the PCA, each web page was represented in the reduced vector space by the co-ordinates of the three eigenvectors. The c4.5 decision tree package [6] was then used to analyze the resulting projection and to extract a set of rules for the decision tree. This decision tree was then used to classify new web pages into one of the three categories and a NOISE category. 4
5 Figure 4. PCA plot of web pages. 3.6 Experimental procedure The 10-fold cross validation process was used for the evaluation experiments. This process is often used to give statistical validity to situations where the data sets are small. In this process, the 430 web pages consisting of 120 Shopping, 120 Health, 120 Education, 70 Not-Shopping- Health-Education (i.e., noise) web pages. These pages were divided randomly into 10 equal partitions. The web pages assigned to each partition represented the distribution of the categories of web pages over the entire data set. One partition was held out as the test set and the other nine partitions are the training set. The classifier, in this case the PCA projection matrix, was trained on the training set and tested on a test set of data. This was repeated for a total of 10 evaluations, each partition being held out in turn as the test set. The results for all of the iterations were then averaged to give the final results. The following steps were followed for each iteration of the 10-fold cross validation process in our experiment: For the training set, Information Gain was calculated for terms in the known Shopping, Health and Education categories. Features were selected from the top 300 Information Gain values to create the reduced feature set. The document-term matrix was generated for all web pages in the training set, including the noise category of web pages. PCA was applied to the matrix and only the eigenvectors associated with the largest three eigenvalues were kept. The c4.5 decision tree was run on the projected matrix to determine the rules for building a decision tree and the tree was generated. The test set of data was projected into the PCA eigenvector space. 5
6 The decision tree was applied to the projected test set to categorize each web page of the test set as Shopping, Health, Education or noise. 4. Results and discussion The results of the 10-fold cross validation process are presented in the confusion matrix of Table 2. A confusion matrix presents a view of both correct and incorrect classifications. The rows represent the original or correct categories and the columns represent the assigned categories. The distribution of web pages in each of the 10 partitions was 12 web pages of each of Health, Shopping and Education and 7 web pages of noise. A perfect system would have values only on the diagonal. Each cell of Table 2 shows the average number of web pages of the original category assigned to that target category, and the standard deviation in parenthesis. For instance, the average number of health web pages assigned to the health category over the 10 iterations was 10.0 with a standard deviation of The average number of health web pages (incorrectly) assigned to the shopping category was 0.8 with a standard deviation of A perfect system would have assigned 12 web pages and 0 web pages, respectively, to these two categories. Original Categories Table 2. Confusion matrix for test data Health Shopping Education Noise Health 10.0 (1.15) 0.8 (0.92) 0.9 (0.57) 0.6 (0.70) Assigned Categories Shopping 0.8 (1.13) 9.9 (1.60) 0.1 (0.32) 1.1 (1.66) Education 0.5 (0.70) 0.1 (0.32) 9.2 (0.63) 1.6 (1.51) Noise 0.71 (1.06) 1.2 (1.03) 1.8 (0.79) 3.7 (1.25) A Chi-Square analysis of this confusion matrix found that the distribution was significant at p=0.001 (df=9, χ = 59.00). These categorization results were calculated using the recall and precision measures. Recall is the proportion of web pages that should be in a particular category that are correctly assigned to that category. Precision is the proportion of web pages that are assigned to a particular category that should be in that category. The recall and precision results are shown in Table 3. Table 3. Recall and precision values. Category Recall Precision Shopping Health Education Noise The overall results for web pages assigned to the Shopping, Health and Education classes are approximately 80% for both recall and precision. Note that recall and precision for the noise class are approximately 50%, indicating that the classifier was not able to categorize those web pages, which represented noise in our data set, from those classes on which it had not been trained. Of particular interest is the analysis of the 3- dimensional plot of the projected data with the noise or noise category (Figure 5. The area around the origin in Figure 5, as indicated by the circle, contains the majority of the noise web pages plus those web pages that were weakly classified. These were web pages that had few, if any, of the 300 features representing the original data set (for example, the web page shown in Figure 6). Those web pages represented by positions that lie on a principal component axis and distant from the origin, such as the point labeled Strong Health (for example, the web page shown in Figure 7), were those web pages that contained a large number of features from a single category and very few from any other category. Those web pages represented by positions that appear to be distant from the origin but do not lie on a principal component axis tended to be those web pages that could be classified into more than one class. These web pages contained features that represent more than a single category. An example in Figure 5 is the point labeled Health with Shopping (for example, the web page shown Figure 8). The web page shown in Figure 8 was classed by YAHOO and the three raters as Health but was in fact about health products for sale and thus contained a number of features that represented the Shopping category. 6
7 Health with Shopping Strong Health Figure 5. PCA plot with noise data. Figure 6. Weakly classified as health web page. 7
8 Figure 7. Strongly classified as a health web page. Figure 8. Classified as health web page but with some shopping characteristics. 8
9 5. Summary and future research In summary, this investigation into automatic web page categorization using Principal Component Analysis has shown promising results. Both recall and precision were slightly over 80%. However, there are a number of limitations and areas that will require further research. In particular, these include the issues of scale, feature set selection, classification of web pages that may belong to two or more categories, and the recognition of new classes. On the issue of scale, the web is huge and we will have to increase both the number of target categories and the number web of pages classified. Although our results are statistically valid, they cannot be used to infer that this approach will produce similar results for much larger data sets or for all categories. The feature set selection is also important. In this initial research we used only the content words and ignored other features such as HTML tags and links which have given other researchers good results. Since authorship of web pages is out of the control of search engines and users, it is not surprising that many web pages could reasonably belong to two or more classes or categories. Although from Figure 5 it would appear that the three categories are well separated (and thus our good results), there was some amount of overlap among the sets. This overlap is exemplified by such web pages as those about health but with some shopping characteristics and education pages that led to health education programs, etc. Refined decision tree analysis may be able to recognize when this occurs and correctly identify the categories involved and assign web pages to multiple categories. Our initial results indicate that we are able to recognize web pages that do belong to any of the target classes with an accuracy of about 50%. This may happen when noise pages are somewhat similar to a known class and may also occur when the features of known classes begin to change as the web continues to evolve. Research is also needed to recognize when a new class has been developed either as a novel class or as a derivative of an existing class. It is expected that increasing the proportion of noise web pages will adversely effect the precision of the classifier. This will be addressed in future research with particular attention to tuning the decision tree rules. Our approach in this research has been feature set selection for the target categories, principal component analysis to reduce the dimensionality and to determine the principal components and, finally, the development and application of a decision tree to categorize the test web pages once projected into the reduced space. Further research will evaluate this approach as compared to other categorization approaches. In particular, we will evaluate our approach against unsupervised learning techniques such as K-means clustering. Ding and He have shown that, principal components are the continuous solutions to the discrete cluster membership indicators for K-means clustering. [5] By selecting features only from the three target categories we have biased our principal components to reflect these target categories. Similarly, this same feature set should bias the content of the clusters formed in unsupervised learning approaches such as K- means clustering. The fact that principal components are continuous solutions accurately reflects the fuzzy nature of some of the web pages, e.g., health pages with some shopping characteristics, and thus the need for the decision tree. Our ultimate goal is to be able to provide sets of web pages with features that reflect the user s task. Such automatic categorization should be able to help the user cope with larger and larger Web search query results. 6. References [1] Chakrabarti, S., Dom, B. and P. Indyk. Enhanced Hypertext Categorization Using Hyperlinks. Proceedings of ACM SIGMOD International Conference on Management of Data, pp , [2] Chekuri, C., Goldwasser, M.H., Prabhakar, R. and Eli Upfal, Web Search Using Automatic Classification, Proceedings of WWW-96, 6th International Conference on the World Wide Web, 1996 [3] Choi, B. and X. Peng. Dynamic and Hierarchical Classification of Web Pages. Online Information Review. Volume 28 Number 2, pp , 2004 [4] Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K. and R. Harshman. Journal of the American Society for Information Science, Vol. 41, No. 6, oo [5] Ding, C. and X. He. K-means Clustering via Principal Component Analysis, Proceedings of the 21 st International Conference on Machine Learning, Banff, Canada, [6] Han, J. and M. Kamber. Data Mining Concepts and Techniques. Morgan Kaufmann Publishers Inc [7] Min-Yen Kan, M.-Y. and H.O.N. Thi. Fast Webpage Classification Using URL Features Proceeding of Conference on Info and Knowledge Management (CIKM '05). Bremen, Germany, November [8] The Porter Stemming Algorithm 9
10 [9] Prakash, A., Kranthi, A. And R. Kumar. Web Page Classification Based on Document Structure 2001, pcds.pdf [10] Salton G. Automatic Text Processing. McGraw-Hill Book Company, [11] Shen, D., Chen, Z., Zeng, H.-J., Zhang, B., Yang, Q., Ma, W.-Y. and Yuchang Lu. Web-Page Classification through Summarization. Proceedings of the Special Interest Group in Information Retrieval, pp , [12] SMART FTP site: ftp://ftp.cs.cornell.edu/pub/smart/ [13] Wikipedia. nalysis [14] Yang, Y. and Jan O. Pedersen. A Comparative Study on Feature Selection in Text Categorization, Proceedings of ICML-97, 14th International Conference on Machine Learning [15] Zheng, Z., Wu, X. And R.K. Srihar. Feature Selection for Text Categorization on Imbalanced Data, CM SIGKDD Explorations Newsletter Volume 6, Issue 1 (June 2004), Pages:
Feature Selection for an n-gram Approach to Web Page Genre Classification
Feature Selection for an n-gram Approach to Web Page Genre Classification Jane E. Mason Michael Shepherd Jack Duffy Technical Report CS-2009-04 June 22, 2009 Faculty of Computer Science 6050 University
More informationLRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier
LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier Wang Ding, Songnian Yu, Shanqing Yu, Wei Wei, and Qianfeng Wang School of Computer Engineering and Science, Shanghai University, 200072
More informationA Novel Feature Selection Framework for Automatic Web Page Classification
International Journal of Automation and Computing 9(4), August 2012, 442-448 DOI: 10.1007/s11633-012-0665-x A Novel Feature Selection Framework for Automatic Web Page Classification J. Alamelu Mangai 1
More informationA Content Vector Model for Text Classification
A Content Vector Model for Text Classification Eric Jiang Abstract As a popular rank-reduced vector space approach, Latent Semantic Indexing (LSI) has been used in information retrieval and other applications.
More informationUsing Text Learning to help Web browsing
Using Text Learning to help Web browsing Dunja Mladenić J.Stefan Institute, Ljubljana, Slovenia Carnegie Mellon University, Pittsburgh, PA, USA Dunja.Mladenic@{ijs.si, cs.cmu.edu} Abstract Web browsing
More informationMinoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University
Information Retrieval System Using Concept Projection Based on PDDP algorithm Minoru SASAKI and Kenji KITA Department of Information Science & Intelligent Systems Faculty of Engineering, Tokushima University
More informationAn Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data
An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data Nian Zhang and Lara Thompson Department of Electrical and Computer Engineering, University
More informationThe Comparative Study of Machine Learning Algorithms in Text Data Classification*
The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification
More informationContents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation
Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation Learning 4 Supervised Learning 4 Unsupervised Learning 4
More informationApplying Supervised Learning
Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains
More informationA Feature Selection Method to Handle Imbalanced Data in Text Classification
A Feature Selection Method to Handle Imbalanced Data in Text Classification Fengxiang Chang 1*, Jun Guo 1, Weiran Xu 1, Kejun Yao 2 1 School of Information and Communication Engineering Beijing University
More informationAnalytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.
Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied
More information10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors
Dejan Sarka Anomaly Detection Sponsors About me SQL Server MVP (17 years) and MCT (20 years) 25 years working with SQL Server Authoring 16 th book Authoring many courses, articles Agenda Introduction Simple
More informationAutomatic Domain Partitioning for Multi-Domain Learning
Automatic Domain Partitioning for Multi-Domain Learning Di Wang diwang@cs.cmu.edu Chenyan Xiong cx@cs.cmu.edu William Yang Wang ww@cmu.edu Abstract Multi-Domain learning (MDL) assumes that the domain labels
More informationA Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition
A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country ana.zelaia@ehu.es
More informationA Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition
A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country ana.zelaia@ehu.es
More informationCOSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor
COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality
More informationFeature Selection Methods for an Improved SVM Classifier
Feature Selection Methods for an Improved SVM Classifier Daniel Morariu, Lucian N. Vintan, and Volker Tresp Abstract Text categorization is the problem of classifying text documents into a set of predefined
More informationJune 15, Abstract. 2. Methodology and Considerations. 1. Introduction
Organizing Internet Bookmarks using Latent Semantic Analysis and Intelligent Icons Note: This file is a homework produced by two students for UCR CS235, Spring 06. In order to fully appreacate it, it may
More informationCombination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset
International Journal of Computer Applications (0975 8887) Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset Mehdi Naseriparsa Islamic Azad University Tehran
More informationSemi-Supervised Clustering with Partial Background Information
Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject
More informationSemantic Website Clustering
Semantic Website Clustering I-Hsuan Yang, Yu-tsun Huang, Yen-Ling Huang 1. Abstract We propose a new approach to cluster the web pages. Utilizing an iterative reinforced algorithm, the model extracts semantic
More informationCSE 158. Web Mining and Recommender Systems. Midterm recap
CSE 158 Web Mining and Recommender Systems Midterm recap Midterm on Wednesday! 5:10 pm 6:10 pm Closed book but I ll provide a similar level of basic info as in the last page of previous midterms CSE 158
More informationComparing and Combining Dimension Reduction Techniques for Efficient Text Clustering
Comparing and Combining Dimension Reduction Techniques for Efficient Text Clustering Bin Tang Michael Shepherd Evangelos Milios Malcolm I. Heywood Faculty of Computer Science, Dalhousie University, Halifax,
More informationA Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression
Journal of Data Analysis and Information Processing, 2016, 4, 55-63 Published Online May 2016 in SciRes. http://www.scirp.org/journal/jdaip http://dx.doi.org/10.4236/jdaip.2016.42005 A Comparative Study
More informationCHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION
CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant
More informationDESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES
EXPERIMENTAL WORK PART I CHAPTER 6 DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES The evaluation of models built using statistical in conjunction with various feature subset
More informationNetwork Traffic Measurements and Analysis
DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,
More informationCS145: INTRODUCTION TO DATA MINING
CS145: INTRODUCTION TO DATA MINING 08: Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu October 24, 2017 Learnt Prediction and Classification Methods Vector Data
More informationUnsupervised learning in Vision
Chapter 7 Unsupervised learning in Vision The fields of Computer Vision and Machine Learning complement each other in a very natural way: the aim of the former is to extract useful information from visual
More informationAn Ensemble Approach to Enhance Performance of Webpage Classification
An Ensemble Approach to Enhance Performance of Webpage Classification Roshani Choudhary 1, Jagdish Raikwal 2 1, 2 Dept. of Information Technology 1, 2 Institute of Engineering & Technology 1, 2 DAVV Indore,
More informationA Spectral-based Clustering Algorithm for Categorical Data Using Data Summaries (SCCADDS)
A Spectral-based Clustering Algorithm for Categorical Data Using Data Summaries (SCCADDS) Eman Abdu eha90@aol.com Graduate Center The City University of New York Douglas Salane dsalane@jjay.cuny.edu Center
More informationClustering Documents in Large Text Corpora
Clustering Documents in Large Text Corpora Bin He Faculty of Computer Science Dalhousie University Halifax, Canada B3H 1W5 bhe@cs.dal.ca http://www.cs.dal.ca/ bhe Yongzheng Zhang Faculty of Computer Science
More informationAnomaly Detection on Data Streams with High Dimensional Data Environment
Anomaly Detection on Data Streams with High Dimensional Data Environment Mr. D. Gokul Prasath 1, Dr. R. Sivaraj, M.E, Ph.D., 2 Department of CSE, Velalar College of Engineering & Technology, Erode 1 Assistant
More informationSprinkled Latent Semantic Indexing for Text Classification with Background Knowledge
Sprinkled Latent Semantic Indexing for Text Classification with Background Knowledge Haiqin Yang and Irwin King Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin,
More informationFeature Selection Using Modified-MCA Based Scoring Metric for Classification
2011 International Conference on Information Communication and Management IPCSIT vol.16 (2011) (2011) IACSIT Press, Singapore Feature Selection Using Modified-MCA Based Scoring Metric for Classification
More informationBest First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis
Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis CHAPTER 3 BEST FIRST AND GREEDY SEARCH BASED CFS AND NAÏVE BAYES ALGORITHMS FOR HEPATITIS DIAGNOSIS 3.1 Introduction
More informationNews Filtering and Summarization System Architecture for Recognition and Summarization of News Pages
Bonfring International Journal of Data Mining, Vol. 7, No. 2, May 2017 11 News Filtering and Summarization System Architecture for Recognition and Summarization of News Pages Bamber and Micah Jason Abstract---
More informationClassification Algorithms in Data Mining
August 9th, 2016 Suhas Mallesh Yash Thakkar Ashok Choudhary CIS660 Data Mining and Big Data Processing -Dr. Sunnie S. Chung Classification Algorithms in Data Mining Deciding on the classification algorithms
More informationApplication of Support Vector Machine Algorithm in Spam Filtering
Application of Support Vector Machine Algorithm in E-Mail Spam Filtering Julia Bluszcz, Daria Fitisova, Alexander Hamann, Alexey Trifonov, Advisor: Patrick Jähnichen Abstract The problem of spam classification
More informationClustering CS 550: Machine Learning
Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf
More informationInformation-Theoretic Feature Selection Algorithms for Text Classification
Proceedings of International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 5 Information-Theoretic Feature Selection Algorithms for Text Classification Jana Novovičová Institute
More informationLinear Discriminant Analysis for 3D Face Recognition System
Linear Discriminant Analysis for 3D Face Recognition System 3.1 Introduction Face recognition and verification have been at the top of the research agenda of the computer vision community in recent times.
More informationANALYSIS COMPUTER SCIENCE Discovery Science, Volume 9, Number 20, April 3, Comparative Study of Classification Algorithms Using Data Mining
ANALYSIS COMPUTER SCIENCE Discovery Science, Volume 9, Number 20, April 3, 2014 ISSN 2278 5485 EISSN 2278 5477 discovery Science Comparative Study of Classification Algorithms Using Data Mining Akhila
More informationFeature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani
Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate
More information2. Data Preprocessing
2. Data Preprocessing Contents of this Chapter 2.1 Introduction 2.2 Data cleaning 2.3 Data integration 2.4 Data transformation 2.5 Data reduction Reference: [Han and Kamber 2006, Chapter 2] SFU, CMPT 459
More informationInternational Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at
Performance Evaluation of Ensemble Method Based Outlier Detection Algorithm Priya. M 1, M. Karthikeyan 2 Department of Computer and Information Science, Annamalai University, Annamalai Nagar, Tamil Nadu,
More informationWeka ( )
Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised
More informationCIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]
CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, 2015. 11:59pm, PDF to Canvas [100 points] Instructions. Please write up your responses to the following problems clearly and concisely.
More informationCS249: ADVANCED DATA MINING
CS249: ADVANCED DATA MINING Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu April 24, 2017 Homework 2 out Announcements Due May 3 rd (11:59pm) Course project proposal
More informationIn this project, I examined methods to classify a corpus of s by their content in order to suggest text blocks for semi-automatic replies.
December 13, 2006 IS256: Applied Natural Language Processing Final Project Email classification for semi-automated reply generation HANNES HESSE mail 2056 Emerson Street Berkeley, CA 94703 phone 1 (510)
More informationPATTERN RECOGNITION USING NEURAL NETWORKS
PATTERN RECOGNITION USING NEURAL NETWORKS Santaji Ghorpade 1, Jayshree Ghorpade 2 and Shamla Mantri 3 1 Department of Information Technology Engineering, Pune University, India santaji_11jan@yahoo.co.in,
More informationDatabase and Knowledge-Base Systems: Data Mining. Martin Ester
Database and Knowledge-Base Systems: Data Mining Martin Ester Simon Fraser University School of Computing Science Graduate Course Spring 2006 CMPT 843, SFU, Martin Ester, 1-06 1 Introduction [Fayyad, Piatetsky-Shapiro
More informationECE 285 Class Project Report
ECE 285 Class Project Report Based on Source localization in an ocean waveguide using supervised machine learning Yiwen Gong ( yig122@eng.ucsd.edu), Yu Chai( yuc385@eng.ucsd.edu ), Yifeng Bu( ybu@eng.ucsd.edu
More informationAutomated Online News Classification with Personalization
Automated Online News Classification with Personalization Chee-Hong Chan Aixin Sun Ee-Peng Lim Center for Advanced Information Systems, Nanyang Technological University Nanyang Avenue, Singapore, 639798
More informationA Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics
A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics Helmut Berger and Dieter Merkl 2 Faculty of Information Technology, University of Technology, Sydney, NSW, Australia hberger@it.uts.edu.au
More informationA New Technique to Optimize User s Browsing Session using Data Mining
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,
More informationClassification. 1 o Semestre 2007/2008
Classification Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 2 3 Single-Class
More informationSearch Engines. Information Retrieval in Practice
Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Classification and Clustering Classification and clustering are classical pattern recognition / machine learning problems
More informationFeature selection. LING 572 Fei Xia
Feature selection LING 572 Fei Xia 1 Creating attribute-value table x 1 x 2 f 1 f 2 f K y Choose features: Define feature templates Instantiate the feature templates Dimensionality reduction: feature selection
More informationA Comparative Study of Selected Classification Algorithms of Data Mining
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 6, June 2015, pg.220
More informationSlides for Data Mining by I. H. Witten and E. Frank
Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-
More informationKeyword Extraction by KNN considering Similarity among Features
64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,
More informationEnhancing Cluster Quality by Using User Browsing Time
Enhancing Cluster Quality by Using User Browsing Time Rehab Duwairi Dept. of Computer Information Systems Jordan Univ. of Sc. and Technology Irbid, Jordan rehab@just.edu.jo Khaleifah Al.jada' Dept. of
More informationProbabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation
Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation Daniel Lowd January 14, 2004 1 Introduction Probabilistic models have shown increasing popularity
More informationIndividualized Error Estimation for Classification and Regression Models
Individualized Error Estimation for Classification and Regression Models Krisztian Buza, Alexandros Nanopoulos, Lars Schmidt-Thieme Abstract Estimating the error of classification and regression models
More informationIntrusion Detection using NASA HTTP Logs AHMAD ARIDA DA CHEN
Intrusion Detection using NASA HTTP Logs AHMAD ARIDA DA CHEN Presentation Overview - Background - Preprocessing - Data Mining Methods to Determine Outliers - Finding Outliers - Outlier Validation -Summary
More information7. Mining Text and Web Data
7. Mining Text and Web Data Contents of this Chapter 7.1 Introduction 7.2 Data Preprocessing 7.3 Text and Web Clustering 7.4 Text and Web Classification 7.5 References [Han & Kamber 2006, Sections 10.4
More informationIdentifying Layout Classes for Mathematical Symbols Using Layout Context
Rochester Institute of Technology RIT Scholar Works Articles 2009 Identifying Layout Classes for Mathematical Symbols Using Layout Context Ling Ouyang Rochester Institute of Technology Richard Zanibbi
More informationCLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS
CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of
More informationAN IMPROVED HYBRIDIZED K- MEANS CLUSTERING ALGORITHM (IHKMCA) FOR HIGHDIMENSIONAL DATASET & IT S PERFORMANCE ANALYSIS
AN IMPROVED HYBRIDIZED K- MEANS CLUSTERING ALGORITHM (IHKMCA) FOR HIGHDIMENSIONAL DATASET & IT S PERFORMANCE ANALYSIS H.S Behera Department of Computer Science and Engineering, Veer Surendra Sai University
More informationHeterogeneous Graph-Based Intent Learning with Queries, Web Pages and Wikipedia Concepts
Heterogeneous Graph-Based Intent Learning with Queries, Web Pages and Wikipedia Concepts Xiang Ren, Yujing Wang, Xiao Yu, Jun Yan, Zheng Chen, Jiawei Han University of Illinois, at Urbana Champaign MicrosoD
More informationExploiting Index Pruning Methods for Clustering XML Collections
Exploiting Index Pruning Methods for Clustering XML Collections Ismail Sengor Altingovde, Duygu Atilgan and Özgür Ulusoy Department of Computer Engineering, Bilkent University, Ankara, Turkey {ismaila,
More informationGeneral Instructions. Questions
CS246: Mining Massive Data Sets Winter 2018 Problem Set 2 Due 11:59pm February 8, 2018 Only one late period is allowed for this homework (11:59pm 2/13). General Instructions Submission instructions: These
More informationFast or furious? - User analysis of SF Express Inc
CS 229 PROJECT, DEC. 2017 1 Fast or furious? - User analysis of SF Express Inc Gege Wen@gegewen, Yiyuan Zhang@yiyuan12, Kezhen Zhao@zkz I. MOTIVATION The motivation of this project is to predict the likelihood
More informationFacial Expression Recognition using Principal Component Analysis with Singular Value Decomposition
ISSN: 2321-7782 (Online) Volume 1, Issue 6, November 2013 International Journal of Advance Research in Computer Science and Management Studies Research Paper Available online at: www.ijarcsms.com Facial
More informationKEYWORDS: Clustering, RFPCM Algorithm, Ranking Method, Query Redirection Method.
IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY IMPROVED ROUGH FUZZY POSSIBILISTIC C-MEANS (RFPCM) CLUSTERING ALGORITHM FOR MARKET DATA T.Buvana*, Dr.P.krishnakumari *Research
More informationhighest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate
Searching Information Servers Based on Customized Proles Technical Report USC-CS-96-636 Shih-Hao Li and Peter B. Danzig Computer Science Department University of Southern California Los Angeles, California
More informationA Review on Identifying the Main Content From Web Pages
A Review on Identifying the Main Content From Web Pages Madhura R. Kaddu 1, Dr. R. B. Kulkarni 2 1, 2 Department of Computer Scienece and Engineering, Walchand Institute of Technology, Solapur University,
More informationDesigning and Building an Automatic Information Retrieval System for Handling the Arabic Data
American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far
More informationAccelerometer Gesture Recognition
Accelerometer Gesture Recognition Michael Xie xie@cs.stanford.edu David Pan napdivad@stanford.edu December 12, 2014 Abstract Our goal is to make gesture-based input for smartphones and smartwatches accurate
More informationCHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES
70 CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES 3.1 INTRODUCTION In medical science, effective tools are essential to categorize and systematically
More informationDimension Reduction CS534
Dimension Reduction CS534 Why dimension reduction? High dimensionality large number of features E.g., documents represented by thousands of words, millions of bigrams Images represented by thousands of
More informationEnhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques
24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE
More informationInfluence of Word Normalization on Text Classification
Influence of Word Normalization on Text Classification Michal Toman a, Roman Tesar a and Karel Jezek a a University of West Bohemia, Faculty of Applied Sciences, Plzen, Czech Republic In this paper we
More informationDomain-specific Concept-based Information Retrieval System
Domain-specific Concept-based Information Retrieval System L. Shen 1, Y. K. Lim 1, H. T. Loh 2 1 Design Technology Institute Ltd, National University of Singapore, Singapore 2 Department of Mechanical
More informationDimension reduction : PCA and Clustering
Dimension reduction : PCA and Clustering By Hanne Jarmer Slides by Christopher Workman Center for Biological Sequence Analysis DTU The DNA Array Analysis Pipeline Array design Probe design Question Experimental
More informationEnhancing Cluster Quality by Using User Browsing Time
Enhancing Cluster Quality by Using User Browsing Time Rehab M. Duwairi* and Khaleifah Al.jada'** * Department of Computer Information Systems, Jordan University of Science and Technology, Irbid 22110,
More information2. Design Methodology
Content-aware Email Multiclass Classification Categorize Emails According to Senders Liwei Wang, Li Du s Abstract People nowadays are overwhelmed by tons of coming emails everyday at work or in their daily
More informationCOMP 465: Data Mining Classification Basics
Supervised vs. Unsupervised Learning COMP 465: Data Mining Classification Basics Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Supervised
More informationSCUBA DIVER: SUBSPACE CLUSTERING OF WEB SEARCH RESULTS
SCUBA DIVER: SUBSPACE CLUSTERING OF WEB SEARCH RESULTS Fatih Gelgi, Srinivas Vadrevu, Hasan Davulcu Department of Computer Science and Engineering, Arizona State University, Tempe, AZ fagelgi@asu.edu,
More informationSNS College of Technology, Coimbatore, India
Support Vector Machine: An efficient classifier for Method Level Bug Prediction using Information Gain 1 M.Vaijayanthi and 2 M. Nithya, 1,2 Assistant Professor, Department of Computer Science and Engineering,
More informationResearch on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a
International Conference on Education Technology, Management and Humanities Science (ETMHS 2015) Research on Applications of Data Mining in Electronic Commerce Xiuping YANG 1, a 1 Computer Science Department,
More informationGeneric Face Alignment Using an Improved Active Shape Model
Generic Face Alignment Using an Improved Active Shape Model Liting Wang, Xiaoqing Ding, Chi Fang Electronic Engineering Department, Tsinghua University, Beijing, China {wanglt, dxq, fangchi} @ocrserv.ee.tsinghua.edu.cn
More informationMODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS
MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS J.I. Serrano M.D. Del Castillo Instituto de Automática Industrial CSIC. Ctra. Campo Real km.0 200. La Poveda. Arganda del Rey. 28500
More informationData Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining
Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Data Preprocessing Aggregation Sampling Dimensionality Reduction Feature subset selection Feature creation
More informationPredicting Popular Xbox games based on Search Queries of Users
1 Predicting Popular Xbox games based on Search Queries of Users Chinmoy Mandayam and Saahil Shenoy I. INTRODUCTION This project is based on a completed Kaggle competition. Our goal is to predict which
More informationMSA220 - Statistical Learning for Big Data
MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups
More informationTwo-Dimensional Visualization for Internet Resource Discovery. Shih-Hao Li and Peter B. Danzig. University of Southern California
Two-Dimensional Visualization for Internet Resource Discovery Shih-Hao Li and Peter B. Danzig Computer Science Department University of Southern California Los Angeles, California 90089-0781 fshli, danzigg@cs.usc.edu
More informationLecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic
SEMANTIC COMPUTING Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic TU Dresden, 23 November 2018 Overview Unsupervised Machine Learning overview Association
More information