Automatic Web Page Categorization using Principal Component Analysis

Size: px
Start display at page:

Download "Automatic Web Page Categorization using Principal Component Analysis"

Transcription

1 Automatic Web Page Categorization using Principal Component Analysis Richong Zhang, Michael Shepherd, Jack Duffy, Carolyn Watters Faculty of Computer Science Dalhousie University Halifax, Nova Scotia, Canada B3H 3L7 Abstract Today s search engines retrieve tens of thousands of web pages in response to fairly simple query articulations. These pages are retrieved on the basis of the query terms occurring in the web pages and the popularity of the web pages as per the link structure of the web. However, these search engines do not take into account the broader information need of the user, such as the task in which the user is involved. This research investigates the automatic categorization of web pages using Principal Component Analysis. The research focuses on user tasks that involve searching for web pages containing health information, education information or shopping information. Initial results are encouraging with recall and precision values slightly in excess of 80%. 1. Introduction Search Engines help us to retrieve web pages that satisfy the specifics of a user s query. The result set, however, is often so large that it is hard to find the page or pages that will actually satisfy the user s information need. One approach to increasing the relevance of the results is to categorize the results in anticipation of the user need. While this would not necessarily reduce the search engine effort, per se, it would allow the user to cope with the results of the search much faster and more accurately. Although search engines could provide this filtering, the categorization could also be done at the user s site on an as needed basis. In this paper we discuss ongoing work to categorize web pages on the basis of automatic interpretation of the author s intent for three categories of web pages related to user task; pages about health information, shopping pages, and pages about education. An approach based on Principal Component Analysis (PCA) has been investigated to develop a method to automatically classify web pages based on task. We use the metric, Information Gain, for feature selection in order to reduce the dimensionality of the document vectors. PCA is then used to map these document vector dimensions into a smaller dimensional space. Individual web documents are then projected into the new space for the purpose of classification. The results of a series of experiments show that this method is able to classify web pages efficiently for these three tasks categories. Section 2 of this paper briefly reviews related research, Section 3 describes the methodology of our research, Section 4 discusses the experimental results, and Section 5 summarizes the paper and discusses future research. 2. Related research The need for methods of classifying Web pages has been recognized for some time as a way to reduce the difficulty of users coping with very large search engine results. Prakash et al. [9] introduced a method to classify web pages based on document structure for university web pages. They proposed a method for the automatic classification of web pages into a few broad categories, including information pages, research pages, and personal home pages. Their method was based on the text content, images, links, videos and other structures of the web document. After testing about 4000 web pages from universities and other domains, 87.83% web pages were correctly categorized. Kan and Thi [7] also used a set of university web pages to show that they could be successfully classified using full text plus the uniform resource locator (URL). They took a subset of the WebKB corpus as the data set and the web pages were classified into student, faculty, course and project pages using a Support Vector Machine (SVM) based on maximum entropy to define the feature set. Previous work has also addressed the issue of categorizing general web pages. Chekuri et al. [2] automatically classified web pages into pre-specified YAHOO categories. They randomly selected 2000 web pages from 20 YAHOO categories and, after training the automatic classifier, they tested 500 new web pages from the same 20 YAHOO categories. They calculated the probability of a document being assigned to each category and ranked the pages by their probability. The result was /07 $ IEEE 1

2 that more than 50% of the test web pages were classified into the correct YAHOO category. Chakrabarti et al. [1] used text and hyperlink features together to build a web page classifier and found this method could significantly improve the accuracy. In addition to the content of the document, they included the classification of the neighbors in the evaluation of the class of each document. The inclusion of the close neighbors of the test document significantly boosted their classification accuracy and they reported a 70% reduction in classification error compared to text only classification. Shen and Chen [11] compared a web page summarization method with the traditional text classification method. They used a Naïve Bayes Classifier and a Support Vector Machine for the baseline classifications using the text content of the Web pages. Their data set included 153,019 pages, distributed over 64 categories from the top two levels on the LookSmart Website. Their summarization method was based on Latent Semantic Analysis (LSA) using terms from the content of the Web pages. Their results indicated that the web pages that were classified based on the summaries produced by the human editors were significantly better (a 13.2% improvement on micro-f1 measure) than using only the text of the Web pages. Experimental results also showed that their automatic summary process could achieve a similar improvement for classification (about 12.9% improvement) [11] over text alone. Another recent approach is to compare web pages against all possible categories and place pages in the class with the highest probability. Peng and Choi [3] used class hierarchies to improve accuracy about 6 percent over similar systems. In the research reported in this paper, our goal is to categorize Web pages quickly into a small number of predefined categories, where the categories are user and user task dependent. For example, when the user is shopping then we are not interested in identifying or discriminating among the other possible categories, we are only interested in quickly identifying shopping pages. Therefore, the system should quickly indicate whether or not a page is a member of the shopping category. 3.1 Dataset The target classes were chosen, arbitrarily, to be Shopping, Health, and Education. As we wanted to generate a dataset consisting of web pages in these three categories, we looked to the YAHOO categories. YAHOO manually classifies web pages into a set of predefined categories {Figure 1). Therefore, we could randomly select a set of web pages, each of which had a known class. The final data set of 430 web pages included 120 web pages, selected randomly, from each of the YAHOO categories of Business & Economy >Shopping, Health, and Education. We also selected 70 web pages to represent noise, i.e., web pages that do not belong to the Shopping, Health or Education categories. These noise pages were selected randomly from the YAHOO categories of Auto Magazine, Calendar, Events, Young Adult, Art History, Election, Games, Sports news and media, Weather, and Animals. All 430 web pages were examined by three raters to determine if there was agreement on the YAHOO assigned categories. The noise web pages were also examined to confirm that they were in fact, not Shopping, Health or Education pages. Web page selection continued until all three raters agreed on all 430 web pages. 3. Methodology The methodology followed in this research consisted of selecting a random set of web pages from selected YAHOO categories to form a data set, cleaning this data set, determining a set of features to represent the data set, building a document-term matrix, applying Principal Component Analysis to weight the features, categorize the Web pages in the test set, and evaluation of the resulting categorization. Figure 1. YAHOO categories. 2

3 3.2 Data cleaning The categorization approach in this research was based solely on content, i.e., key words. Therefore, it was necessary to remove all HTML tags and images, etc., from all of the web pages. All remaining words were converted to lowercase, stop words [12] were removed and the remaining words were stemmed with Porter s algorithm [8]. This resulted in 10,985 unique word stems. 3.3 Feature selection Feature Selection, widely used in pattern recognition and data mining, selects a set of features based on some criteria of the features such that the resulting smaller set has a high representational capability. Feature selection reduces the number of features, in this case keywords, needed for processing, such that processing time is reduced. In our case the initial feature set consisted of all 10,985 unique word stems. Information Gain (IG) [14, 15], an information theoretic measure, was used to rank the features so that a threshold could be established above which the features were select for the reduced set of features. The IG measure is based on the entropy associated with a feature (word stem) with respect to its ability to correctly predict what category a given document occurs in. It is given by: IG( t) = Where: m i= 1 + P( t) P( c )log P( + P( t ) i c i m i= 1 m i= 1 i ) P( c t)log P( c t) P( c t )log P( c t ) i P( c i ) is the probability of a document of class i occurring P( c i t) is the probability of a document of class i occurring given it contains term t P( c i t ) is the probability of a document of class i occurring given that the document does not contain term t i i The IG was calculated for each term in the term set derived from the web pages from the Shopping, Health and Education categories after the data cleaning process. Table 1 shows the top 20 features (word stems) as determined by the IG values (shown truncated to 3 decimal places). Also shown are the number of web pages in each of the three categories in which each feature occurs. Table 1. Top 20 features by IG value Health Shopping Education IG value Educ Diseas Medic Health Teacher School Price Item Ship Student Custom Accessori Cancer Doctor Public Shop Heart Cart Medicin Physician Once each of the 10,985 features was assigned an IG value, it was possible to select the best or most discriminating features. Figure 2 shows a plot of the IG values in descending order. The Y-axis represents the IG value and the X-axis the rank of the features. As can be seen, there is rapid decrease in the initial set of IG values before the curve starts to flatten out representing much smaller differences among those features with respect to their ability to differentiate among the classes. After some experimentation, the threshold was established using the features with the top 300 IG values in the final feature set. 3

4 Information Gain Information Gain In applying PCA, we calculated the covariance matrix of the feature set that had been reduced using Information Gain, and then calculated the eigenvalues and eigenvectors of this covariance matrix. The largest eigenvalue identifies the eigenvector that expresses the most significant relationship among the data dimensions. This eigenvector is the first principal component, which is then chosen as the most significant component. The results of the PCA analysis of our reduced document-term matrix, indicated that the first three eigenvectors carry most of the information (Figure 3) Figure 2. Plot of IG values. Eigenvalues 3.4 Principal Component Analysis The reduced feature set was used to create a documentterm matrix to represent the 360 documents from the Shopping, Health and Education classes. The resulting 360 by 300 matrix represents each web page as a vector of 300 columns. The values in the document-term matrix are binary, representing the simple occurrence or nonoccurrence of that feature in that web page. The tf.idf [10] weighting scheme was also investigated but the results were not significantly different from those using binary weights. Principal Component Analysis (PCA) was applied to this binary matrix to determine if we could distinguish among the three categories using this data. Principal Components Analysis [13] is a technique for simplifying a dataset. It is a linear transformation of the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. PCA can be used for dimensionality reduction in a dataset while retaining those characteristics of the dataset that contribute most to its variance, by keeping lowerorder principal components and ignoring higher-order ones. Such low-order components often contain the most important aspects of the data. One advantage of PCA, important to this research, is that once the patterns in the data have been found, the number of dimensions may be reduced without much subsequent loss of information. PCA is similar to Latent Semantic Indexing [4] in that the dimensionality is reduced and then the terms and documents can be projected into this reduced space. First 3 eigenvectors carry most of the information Eigenvectors Figure 3. PCA generated eigenvalues Consequently, we used the PCA results to project the 360x300 document-term matrix into a 360x3 matrix. The resulting 3-dimensional graph (Figure 4) shows that the Shopping, Health and Education web pages do appear to cluster along these three principal components (eigenvectors). The circles represent the Shopping web pages, the stars the Health web pages and the plus signs represent the Education web pages. 3.5 Decision tree categorization After the PCA, each web page was represented in the reduced vector space by the co-ordinates of the three eigenvectors. The c4.5 decision tree package [6] was then used to analyze the resulting projection and to extract a set of rules for the decision tree. This decision tree was then used to classify new web pages into one of the three categories and a NOISE category. 4

5 Figure 4. PCA plot of web pages. 3.6 Experimental procedure The 10-fold cross validation process was used for the evaluation experiments. This process is often used to give statistical validity to situations where the data sets are small. In this process, the 430 web pages consisting of 120 Shopping, 120 Health, 120 Education, 70 Not-Shopping- Health-Education (i.e., noise) web pages. These pages were divided randomly into 10 equal partitions. The web pages assigned to each partition represented the distribution of the categories of web pages over the entire data set. One partition was held out as the test set and the other nine partitions are the training set. The classifier, in this case the PCA projection matrix, was trained on the training set and tested on a test set of data. This was repeated for a total of 10 evaluations, each partition being held out in turn as the test set. The results for all of the iterations were then averaged to give the final results. The following steps were followed for each iteration of the 10-fold cross validation process in our experiment: For the training set, Information Gain was calculated for terms in the known Shopping, Health and Education categories. Features were selected from the top 300 Information Gain values to create the reduced feature set. The document-term matrix was generated for all web pages in the training set, including the noise category of web pages. PCA was applied to the matrix and only the eigenvectors associated with the largest three eigenvalues were kept. The c4.5 decision tree was run on the projected matrix to determine the rules for building a decision tree and the tree was generated. The test set of data was projected into the PCA eigenvector space. 5

6 The decision tree was applied to the projected test set to categorize each web page of the test set as Shopping, Health, Education or noise. 4. Results and discussion The results of the 10-fold cross validation process are presented in the confusion matrix of Table 2. A confusion matrix presents a view of both correct and incorrect classifications. The rows represent the original or correct categories and the columns represent the assigned categories. The distribution of web pages in each of the 10 partitions was 12 web pages of each of Health, Shopping and Education and 7 web pages of noise. A perfect system would have values only on the diagonal. Each cell of Table 2 shows the average number of web pages of the original category assigned to that target category, and the standard deviation in parenthesis. For instance, the average number of health web pages assigned to the health category over the 10 iterations was 10.0 with a standard deviation of The average number of health web pages (incorrectly) assigned to the shopping category was 0.8 with a standard deviation of A perfect system would have assigned 12 web pages and 0 web pages, respectively, to these two categories. Original Categories Table 2. Confusion matrix for test data Health Shopping Education Noise Health 10.0 (1.15) 0.8 (0.92) 0.9 (0.57) 0.6 (0.70) Assigned Categories Shopping 0.8 (1.13) 9.9 (1.60) 0.1 (0.32) 1.1 (1.66) Education 0.5 (0.70) 0.1 (0.32) 9.2 (0.63) 1.6 (1.51) Noise 0.71 (1.06) 1.2 (1.03) 1.8 (0.79) 3.7 (1.25) A Chi-Square analysis of this confusion matrix found that the distribution was significant at p=0.001 (df=9, χ = 59.00). These categorization results were calculated using the recall and precision measures. Recall is the proportion of web pages that should be in a particular category that are correctly assigned to that category. Precision is the proportion of web pages that are assigned to a particular category that should be in that category. The recall and precision results are shown in Table 3. Table 3. Recall and precision values. Category Recall Precision Shopping Health Education Noise The overall results for web pages assigned to the Shopping, Health and Education classes are approximately 80% for both recall and precision. Note that recall and precision for the noise class are approximately 50%, indicating that the classifier was not able to categorize those web pages, which represented noise in our data set, from those classes on which it had not been trained. Of particular interest is the analysis of the 3- dimensional plot of the projected data with the noise or noise category (Figure 5. The area around the origin in Figure 5, as indicated by the circle, contains the majority of the noise web pages plus those web pages that were weakly classified. These were web pages that had few, if any, of the 300 features representing the original data set (for example, the web page shown in Figure 6). Those web pages represented by positions that lie on a principal component axis and distant from the origin, such as the point labeled Strong Health (for example, the web page shown in Figure 7), were those web pages that contained a large number of features from a single category and very few from any other category. Those web pages represented by positions that appear to be distant from the origin but do not lie on a principal component axis tended to be those web pages that could be classified into more than one class. These web pages contained features that represent more than a single category. An example in Figure 5 is the point labeled Health with Shopping (for example, the web page shown Figure 8). The web page shown in Figure 8 was classed by YAHOO and the three raters as Health but was in fact about health products for sale and thus contained a number of features that represented the Shopping category. 6

7 Health with Shopping Strong Health Figure 5. PCA plot with noise data. Figure 6. Weakly classified as health web page. 7

8 Figure 7. Strongly classified as a health web page. Figure 8. Classified as health web page but with some shopping characteristics. 8

9 5. Summary and future research In summary, this investigation into automatic web page categorization using Principal Component Analysis has shown promising results. Both recall and precision were slightly over 80%. However, there are a number of limitations and areas that will require further research. In particular, these include the issues of scale, feature set selection, classification of web pages that may belong to two or more categories, and the recognition of new classes. On the issue of scale, the web is huge and we will have to increase both the number of target categories and the number web of pages classified. Although our results are statistically valid, they cannot be used to infer that this approach will produce similar results for much larger data sets or for all categories. The feature set selection is also important. In this initial research we used only the content words and ignored other features such as HTML tags and links which have given other researchers good results. Since authorship of web pages is out of the control of search engines and users, it is not surprising that many web pages could reasonably belong to two or more classes or categories. Although from Figure 5 it would appear that the three categories are well separated (and thus our good results), there was some amount of overlap among the sets. This overlap is exemplified by such web pages as those about health but with some shopping characteristics and education pages that led to health education programs, etc. Refined decision tree analysis may be able to recognize when this occurs and correctly identify the categories involved and assign web pages to multiple categories. Our initial results indicate that we are able to recognize web pages that do belong to any of the target classes with an accuracy of about 50%. This may happen when noise pages are somewhat similar to a known class and may also occur when the features of known classes begin to change as the web continues to evolve. Research is also needed to recognize when a new class has been developed either as a novel class or as a derivative of an existing class. It is expected that increasing the proportion of noise web pages will adversely effect the precision of the classifier. This will be addressed in future research with particular attention to tuning the decision tree rules. Our approach in this research has been feature set selection for the target categories, principal component analysis to reduce the dimensionality and to determine the principal components and, finally, the development and application of a decision tree to categorize the test web pages once projected into the reduced space. Further research will evaluate this approach as compared to other categorization approaches. In particular, we will evaluate our approach against unsupervised learning techniques such as K-means clustering. Ding and He have shown that, principal components are the continuous solutions to the discrete cluster membership indicators for K-means clustering. [5] By selecting features only from the three target categories we have biased our principal components to reflect these target categories. Similarly, this same feature set should bias the content of the clusters formed in unsupervised learning approaches such as K- means clustering. The fact that principal components are continuous solutions accurately reflects the fuzzy nature of some of the web pages, e.g., health pages with some shopping characteristics, and thus the need for the decision tree. Our ultimate goal is to be able to provide sets of web pages with features that reflect the user s task. Such automatic categorization should be able to help the user cope with larger and larger Web search query results. 6. References [1] Chakrabarti, S., Dom, B. and P. Indyk. Enhanced Hypertext Categorization Using Hyperlinks. Proceedings of ACM SIGMOD International Conference on Management of Data, pp , [2] Chekuri, C., Goldwasser, M.H., Prabhakar, R. and Eli Upfal, Web Search Using Automatic Classification, Proceedings of WWW-96, 6th International Conference on the World Wide Web, 1996 [3] Choi, B. and X. Peng. Dynamic and Hierarchical Classification of Web Pages. Online Information Review. Volume 28 Number 2, pp , 2004 [4] Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K. and R. Harshman. Journal of the American Society for Information Science, Vol. 41, No. 6, oo [5] Ding, C. and X. He. K-means Clustering via Principal Component Analysis, Proceedings of the 21 st International Conference on Machine Learning, Banff, Canada, [6] Han, J. and M. Kamber. Data Mining Concepts and Techniques. Morgan Kaufmann Publishers Inc [7] Min-Yen Kan, M.-Y. and H.O.N. Thi. Fast Webpage Classification Using URL Features Proceeding of Conference on Info and Knowledge Management (CIKM '05). Bremen, Germany, November [8] The Porter Stemming Algorithm 9

10 [9] Prakash, A., Kranthi, A. And R. Kumar. Web Page Classification Based on Document Structure 2001, pcds.pdf [10] Salton G. Automatic Text Processing. McGraw-Hill Book Company, [11] Shen, D., Chen, Z., Zeng, H.-J., Zhang, B., Yang, Q., Ma, W.-Y. and Yuchang Lu. Web-Page Classification through Summarization. Proceedings of the Special Interest Group in Information Retrieval, pp , [12] SMART FTP site: ftp://ftp.cs.cornell.edu/pub/smart/ [13] Wikipedia. nalysis [14] Yang, Y. and Jan O. Pedersen. A Comparative Study on Feature Selection in Text Categorization, Proceedings of ICML-97, 14th International Conference on Machine Learning [15] Zheng, Z., Wu, X. And R.K. Srihar. Feature Selection for Text Categorization on Imbalanced Data, CM SIGKDD Explorations Newsletter Volume 6, Issue 1 (June 2004), Pages:

Feature Selection for an n-gram Approach to Web Page Genre Classification

Feature Selection for an n-gram Approach to Web Page Genre Classification Feature Selection for an n-gram Approach to Web Page Genre Classification Jane E. Mason Michael Shepherd Jack Duffy Technical Report CS-2009-04 June 22, 2009 Faculty of Computer Science 6050 University

More information

LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier

LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier Wang Ding, Songnian Yu, Shanqing Yu, Wei Wei, and Qianfeng Wang School of Computer Engineering and Science, Shanghai University, 200072

More information

A Novel Feature Selection Framework for Automatic Web Page Classification

A Novel Feature Selection Framework for Automatic Web Page Classification International Journal of Automation and Computing 9(4), August 2012, 442-448 DOI: 10.1007/s11633-012-0665-x A Novel Feature Selection Framework for Automatic Web Page Classification J. Alamelu Mangai 1

More information

A Content Vector Model for Text Classification

A Content Vector Model for Text Classification A Content Vector Model for Text Classification Eric Jiang Abstract As a popular rank-reduced vector space approach, Latent Semantic Indexing (LSI) has been used in information retrieval and other applications.

More information

Using Text Learning to help Web browsing

Using Text Learning to help Web browsing Using Text Learning to help Web browsing Dunja Mladenić J.Stefan Institute, Ljubljana, Slovenia Carnegie Mellon University, Pittsburgh, PA, USA Dunja.Mladenic@{ijs.si, cs.cmu.edu} Abstract Web browsing

More information

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University Information Retrieval System Using Concept Projection Based on PDDP algorithm Minoru SASAKI and Kenji KITA Department of Information Science & Intelligent Systems Faculty of Engineering, Tokushima University

More information

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data Nian Zhang and Lara Thompson Department of Electrical and Computer Engineering, University

More information

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

The Comparative Study of Machine Learning Algorithms in Text Data Classification* The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification

More information

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation Learning 4 Supervised Learning 4 Unsupervised Learning 4

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

A Feature Selection Method to Handle Imbalanced Data in Text Classification

A Feature Selection Method to Handle Imbalanced Data in Text Classification A Feature Selection Method to Handle Imbalanced Data in Text Classification Fengxiang Chang 1*, Jun Guo 1, Weiran Xu 1, Kejun Yao 2 1 School of Information and Communication Engineering Beijing University

More information

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset. Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied

More information

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors Dejan Sarka Anomaly Detection Sponsors About me SQL Server MVP (17 years) and MCT (20 years) 25 years working with SQL Server Authoring 16 th book Authoring many courses, articles Agenda Introduction Simple

More information

Automatic Domain Partitioning for Multi-Domain Learning

Automatic Domain Partitioning for Multi-Domain Learning Automatic Domain Partitioning for Multi-Domain Learning Di Wang diwang@cs.cmu.edu Chenyan Xiong cx@cs.cmu.edu William Yang Wang ww@cmu.edu Abstract Multi-Domain learning (MDL) assumes that the domain labels

More information

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country ana.zelaia@ehu.es

More information

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country ana.zelaia@ehu.es

More information

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality

More information

Feature Selection Methods for an Improved SVM Classifier

Feature Selection Methods for an Improved SVM Classifier Feature Selection Methods for an Improved SVM Classifier Daniel Morariu, Lucian N. Vintan, and Volker Tresp Abstract Text categorization is the problem of classifying text documents into a set of predefined

More information

June 15, Abstract. 2. Methodology and Considerations. 1. Introduction

June 15, Abstract. 2. Methodology and Considerations. 1. Introduction Organizing Internet Bookmarks using Latent Semantic Analysis and Intelligent Icons Note: This file is a homework produced by two students for UCR CS235, Spring 06. In order to fully appreacate it, it may

More information

Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset

Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset International Journal of Computer Applications (0975 8887) Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset Mehdi Naseriparsa Islamic Azad University Tehran

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Semantic Website Clustering

Semantic Website Clustering Semantic Website Clustering I-Hsuan Yang, Yu-tsun Huang, Yen-Ling Huang 1. Abstract We propose a new approach to cluster the web pages. Utilizing an iterative reinforced algorithm, the model extracts semantic

More information

CSE 158. Web Mining and Recommender Systems. Midterm recap

CSE 158. Web Mining and Recommender Systems. Midterm recap CSE 158 Web Mining and Recommender Systems Midterm recap Midterm on Wednesday! 5:10 pm 6:10 pm Closed book but I ll provide a similar level of basic info as in the last page of previous midterms CSE 158

More information

Comparing and Combining Dimension Reduction Techniques for Efficient Text Clustering

Comparing and Combining Dimension Reduction Techniques for Efficient Text Clustering Comparing and Combining Dimension Reduction Techniques for Efficient Text Clustering Bin Tang Michael Shepherd Evangelos Milios Malcolm I. Heywood Faculty of Computer Science, Dalhousie University, Halifax,

More information

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression Journal of Data Analysis and Information Processing, 2016, 4, 55-63 Published Online May 2016 in SciRes. http://www.scirp.org/journal/jdaip http://dx.doi.org/10.4236/jdaip.2016.42005 A Comparative Study

More information

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant

More information

DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES

DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES EXPERIMENTAL WORK PART I CHAPTER 6 DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES The evaluation of models built using statistical in conjunction with various feature subset

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 08: Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu October 24, 2017 Learnt Prediction and Classification Methods Vector Data

More information

Unsupervised learning in Vision

Unsupervised learning in Vision Chapter 7 Unsupervised learning in Vision The fields of Computer Vision and Machine Learning complement each other in a very natural way: the aim of the former is to extract useful information from visual

More information

An Ensemble Approach to Enhance Performance of Webpage Classification

An Ensemble Approach to Enhance Performance of Webpage Classification An Ensemble Approach to Enhance Performance of Webpage Classification Roshani Choudhary 1, Jagdish Raikwal 2 1, 2 Dept. of Information Technology 1, 2 Institute of Engineering & Technology 1, 2 DAVV Indore,

More information

A Spectral-based Clustering Algorithm for Categorical Data Using Data Summaries (SCCADDS)

A Spectral-based Clustering Algorithm for Categorical Data Using Data Summaries (SCCADDS) A Spectral-based Clustering Algorithm for Categorical Data Using Data Summaries (SCCADDS) Eman Abdu eha90@aol.com Graduate Center The City University of New York Douglas Salane dsalane@jjay.cuny.edu Center

More information

Clustering Documents in Large Text Corpora

Clustering Documents in Large Text Corpora Clustering Documents in Large Text Corpora Bin He Faculty of Computer Science Dalhousie University Halifax, Canada B3H 1W5 bhe@cs.dal.ca http://www.cs.dal.ca/ bhe Yongzheng Zhang Faculty of Computer Science

More information

Anomaly Detection on Data Streams with High Dimensional Data Environment

Anomaly Detection on Data Streams with High Dimensional Data Environment Anomaly Detection on Data Streams with High Dimensional Data Environment Mr. D. Gokul Prasath 1, Dr. R. Sivaraj, M.E, Ph.D., 2 Department of CSE, Velalar College of Engineering & Technology, Erode 1 Assistant

More information

Sprinkled Latent Semantic Indexing for Text Classification with Background Knowledge

Sprinkled Latent Semantic Indexing for Text Classification with Background Knowledge Sprinkled Latent Semantic Indexing for Text Classification with Background Knowledge Haiqin Yang and Irwin King Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin,

More information

Feature Selection Using Modified-MCA Based Scoring Metric for Classification

Feature Selection Using Modified-MCA Based Scoring Metric for Classification 2011 International Conference on Information Communication and Management IPCSIT vol.16 (2011) (2011) IACSIT Press, Singapore Feature Selection Using Modified-MCA Based Scoring Metric for Classification

More information

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis CHAPTER 3 BEST FIRST AND GREEDY SEARCH BASED CFS AND NAÏVE BAYES ALGORITHMS FOR HEPATITIS DIAGNOSIS 3.1 Introduction

More information

News Filtering and Summarization System Architecture for Recognition and Summarization of News Pages

News Filtering and Summarization System Architecture for Recognition and Summarization of News Pages Bonfring International Journal of Data Mining, Vol. 7, No. 2, May 2017 11 News Filtering and Summarization System Architecture for Recognition and Summarization of News Pages Bamber and Micah Jason Abstract---

More information

Classification Algorithms in Data Mining

Classification Algorithms in Data Mining August 9th, 2016 Suhas Mallesh Yash Thakkar Ashok Choudhary CIS660 Data Mining and Big Data Processing -Dr. Sunnie S. Chung Classification Algorithms in Data Mining Deciding on the classification algorithms

More information

Application of Support Vector Machine Algorithm in Spam Filtering

Application of Support Vector Machine Algorithm in  Spam Filtering Application of Support Vector Machine Algorithm in E-Mail Spam Filtering Julia Bluszcz, Daria Fitisova, Alexander Hamann, Alexey Trifonov, Advisor: Patrick Jähnichen Abstract The problem of spam classification

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

Information-Theoretic Feature Selection Algorithms for Text Classification

Information-Theoretic Feature Selection Algorithms for Text Classification Proceedings of International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 5 Information-Theoretic Feature Selection Algorithms for Text Classification Jana Novovičová Institute

More information

Linear Discriminant Analysis for 3D Face Recognition System

Linear Discriminant Analysis for 3D Face Recognition System Linear Discriminant Analysis for 3D Face Recognition System 3.1 Introduction Face recognition and verification have been at the top of the research agenda of the computer vision community in recent times.

More information

ANALYSIS COMPUTER SCIENCE Discovery Science, Volume 9, Number 20, April 3, Comparative Study of Classification Algorithms Using Data Mining

ANALYSIS COMPUTER SCIENCE Discovery Science, Volume 9, Number 20, April 3, Comparative Study of Classification Algorithms Using Data Mining ANALYSIS COMPUTER SCIENCE Discovery Science, Volume 9, Number 20, April 3, 2014 ISSN 2278 5485 EISSN 2278 5477 discovery Science Comparative Study of Classification Algorithms Using Data Mining Akhila

More information

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate

More information

2. Data Preprocessing

2. Data Preprocessing 2. Data Preprocessing Contents of this Chapter 2.1 Introduction 2.2 Data cleaning 2.3 Data integration 2.4 Data transformation 2.5 Data reduction Reference: [Han and Kamber 2006, Chapter 2] SFU, CMPT 459

More information

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at Performance Evaluation of Ensemble Method Based Outlier Detection Algorithm Priya. M 1, M. Karthikeyan 2 Department of Computer and Information Science, Annamalai University, Annamalai Nagar, Tamil Nadu,

More information

Weka ( )

Weka (  ) Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised

More information

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points] CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, 2015. 11:59pm, PDF to Canvas [100 points] Instructions. Please write up your responses to the following problems clearly and concisely.

More information

CS249: ADVANCED DATA MINING

CS249: ADVANCED DATA MINING CS249: ADVANCED DATA MINING Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu April 24, 2017 Homework 2 out Announcements Due May 3 rd (11:59pm) Course project proposal

More information

In this project, I examined methods to classify a corpus of s by their content in order to suggest text blocks for semi-automatic replies.

In this project, I examined methods to classify a corpus of  s by their content in order to suggest text blocks for semi-automatic replies. December 13, 2006 IS256: Applied Natural Language Processing Final Project Email classification for semi-automated reply generation HANNES HESSE mail 2056 Emerson Street Berkeley, CA 94703 phone 1 (510)

More information

PATTERN RECOGNITION USING NEURAL NETWORKS

PATTERN RECOGNITION USING NEURAL NETWORKS PATTERN RECOGNITION USING NEURAL NETWORKS Santaji Ghorpade 1, Jayshree Ghorpade 2 and Shamla Mantri 3 1 Department of Information Technology Engineering, Pune University, India santaji_11jan@yahoo.co.in,

More information

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Database and Knowledge-Base Systems: Data Mining. Martin Ester Database and Knowledge-Base Systems: Data Mining Martin Ester Simon Fraser University School of Computing Science Graduate Course Spring 2006 CMPT 843, SFU, Martin Ester, 1-06 1 Introduction [Fayyad, Piatetsky-Shapiro

More information

ECE 285 Class Project Report

ECE 285 Class Project Report ECE 285 Class Project Report Based on Source localization in an ocean waveguide using supervised machine learning Yiwen Gong ( yig122@eng.ucsd.edu), Yu Chai( yuc385@eng.ucsd.edu ), Yifeng Bu( ybu@eng.ucsd.edu

More information

Automated Online News Classification with Personalization

Automated Online News Classification with Personalization Automated Online News Classification with Personalization Chee-Hong Chan Aixin Sun Ee-Peng Lim Center for Advanced Information Systems, Nanyang Technological University Nanyang Avenue, Singapore, 639798

More information

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics Helmut Berger and Dieter Merkl 2 Faculty of Information Technology, University of Technology, Sydney, NSW, Australia hberger@it.uts.edu.au

More information

A New Technique to Optimize User s Browsing Session using Data Mining

A New Technique to Optimize User s Browsing Session using Data Mining Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,

More information

Classification. 1 o Semestre 2007/2008

Classification. 1 o Semestre 2007/2008 Classification Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 2 3 Single-Class

More information

Search Engines. Information Retrieval in Practice

Search Engines. Information Retrieval in Practice Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Classification and Clustering Classification and clustering are classical pattern recognition / machine learning problems

More information

Feature selection. LING 572 Fei Xia

Feature selection. LING 572 Fei Xia Feature selection LING 572 Fei Xia 1 Creating attribute-value table x 1 x 2 f 1 f 2 f K y Choose features: Define feature templates Instantiate the feature templates Dimensionality reduction: feature selection

More information

A Comparative Study of Selected Classification Algorithms of Data Mining

A Comparative Study of Selected Classification Algorithms of Data Mining Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 6, June 2015, pg.220

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information

Keyword Extraction by KNN considering Similarity among Features

Keyword Extraction by KNN considering Similarity among Features 64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,

More information

Enhancing Cluster Quality by Using User Browsing Time

Enhancing Cluster Quality by Using User Browsing Time Enhancing Cluster Quality by Using User Browsing Time Rehab Duwairi Dept. of Computer Information Systems Jordan Univ. of Sc. and Technology Irbid, Jordan rehab@just.edu.jo Khaleifah Al.jada' Dept. of

More information

Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation

Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation Daniel Lowd January 14, 2004 1 Introduction Probabilistic models have shown increasing popularity

More information

Individualized Error Estimation for Classification and Regression Models

Individualized Error Estimation for Classification and Regression Models Individualized Error Estimation for Classification and Regression Models Krisztian Buza, Alexandros Nanopoulos, Lars Schmidt-Thieme Abstract Estimating the error of classification and regression models

More information

Intrusion Detection using NASA HTTP Logs AHMAD ARIDA DA CHEN

Intrusion Detection using NASA HTTP Logs AHMAD ARIDA DA CHEN Intrusion Detection using NASA HTTP Logs AHMAD ARIDA DA CHEN Presentation Overview - Background - Preprocessing - Data Mining Methods to Determine Outliers - Finding Outliers - Outlier Validation -Summary

More information

7. Mining Text and Web Data

7. Mining Text and Web Data 7. Mining Text and Web Data Contents of this Chapter 7.1 Introduction 7.2 Data Preprocessing 7.3 Text and Web Clustering 7.4 Text and Web Classification 7.5 References [Han & Kamber 2006, Sections 10.4

More information

Identifying Layout Classes for Mathematical Symbols Using Layout Context

Identifying Layout Classes for Mathematical Symbols Using Layout Context Rochester Institute of Technology RIT Scholar Works Articles 2009 Identifying Layout Classes for Mathematical Symbols Using Layout Context Ling Ouyang Rochester Institute of Technology Richard Zanibbi

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

AN IMPROVED HYBRIDIZED K- MEANS CLUSTERING ALGORITHM (IHKMCA) FOR HIGHDIMENSIONAL DATASET & IT S PERFORMANCE ANALYSIS

AN IMPROVED HYBRIDIZED K- MEANS CLUSTERING ALGORITHM (IHKMCA) FOR HIGHDIMENSIONAL DATASET & IT S PERFORMANCE ANALYSIS AN IMPROVED HYBRIDIZED K- MEANS CLUSTERING ALGORITHM (IHKMCA) FOR HIGHDIMENSIONAL DATASET & IT S PERFORMANCE ANALYSIS H.S Behera Department of Computer Science and Engineering, Veer Surendra Sai University

More information

Heterogeneous Graph-Based Intent Learning with Queries, Web Pages and Wikipedia Concepts

Heterogeneous Graph-Based Intent Learning with Queries, Web Pages and Wikipedia Concepts Heterogeneous Graph-Based Intent Learning with Queries, Web Pages and Wikipedia Concepts Xiang Ren, Yujing Wang, Xiao Yu, Jun Yan, Zheng Chen, Jiawei Han University of Illinois, at Urbana Champaign MicrosoD

More information

Exploiting Index Pruning Methods for Clustering XML Collections

Exploiting Index Pruning Methods for Clustering XML Collections Exploiting Index Pruning Methods for Clustering XML Collections Ismail Sengor Altingovde, Duygu Atilgan and Özgür Ulusoy Department of Computer Engineering, Bilkent University, Ankara, Turkey {ismaila,

More information

General Instructions. Questions

General Instructions. Questions CS246: Mining Massive Data Sets Winter 2018 Problem Set 2 Due 11:59pm February 8, 2018 Only one late period is allowed for this homework (11:59pm 2/13). General Instructions Submission instructions: These

More information

Fast or furious? - User analysis of SF Express Inc

Fast or furious? - User analysis of SF Express Inc CS 229 PROJECT, DEC. 2017 1 Fast or furious? - User analysis of SF Express Inc Gege Wen@gegewen, Yiyuan Zhang@yiyuan12, Kezhen Zhao@zkz I. MOTIVATION The motivation of this project is to predict the likelihood

More information

Facial Expression Recognition using Principal Component Analysis with Singular Value Decomposition

Facial Expression Recognition using Principal Component Analysis with Singular Value Decomposition ISSN: 2321-7782 (Online) Volume 1, Issue 6, November 2013 International Journal of Advance Research in Computer Science and Management Studies Research Paper Available online at: www.ijarcsms.com Facial

More information

KEYWORDS: Clustering, RFPCM Algorithm, Ranking Method, Query Redirection Method.

KEYWORDS: Clustering, RFPCM Algorithm, Ranking Method, Query Redirection Method. IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY IMPROVED ROUGH FUZZY POSSIBILISTIC C-MEANS (RFPCM) CLUSTERING ALGORITHM FOR MARKET DATA T.Buvana*, Dr.P.krishnakumari *Research

More information

highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate

highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate Searching Information Servers Based on Customized Proles Technical Report USC-CS-96-636 Shih-Hao Li and Peter B. Danzig Computer Science Department University of Southern California Los Angeles, California

More information

A Review on Identifying the Main Content From Web Pages

A Review on Identifying the Main Content From Web Pages A Review on Identifying the Main Content From Web Pages Madhura R. Kaddu 1, Dr. R. B. Kulkarni 2 1, 2 Department of Computer Scienece and Engineering, Walchand Institute of Technology, Solapur University,

More information

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far

More information

Accelerometer Gesture Recognition

Accelerometer Gesture Recognition Accelerometer Gesture Recognition Michael Xie xie@cs.stanford.edu David Pan napdivad@stanford.edu December 12, 2014 Abstract Our goal is to make gesture-based input for smartphones and smartwatches accurate

More information

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES 70 CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES 3.1 INTRODUCTION In medical science, effective tools are essential to categorize and systematically

More information

Dimension Reduction CS534

Dimension Reduction CS534 Dimension Reduction CS534 Why dimension reduction? High dimensionality large number of features E.g., documents represented by thousands of words, millions of bigrams Images represented by thousands of

More information

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques 24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE

More information

Influence of Word Normalization on Text Classification

Influence of Word Normalization on Text Classification Influence of Word Normalization on Text Classification Michal Toman a, Roman Tesar a and Karel Jezek a a University of West Bohemia, Faculty of Applied Sciences, Plzen, Czech Republic In this paper we

More information

Domain-specific Concept-based Information Retrieval System

Domain-specific Concept-based Information Retrieval System Domain-specific Concept-based Information Retrieval System L. Shen 1, Y. K. Lim 1, H. T. Loh 2 1 Design Technology Institute Ltd, National University of Singapore, Singapore 2 Department of Mechanical

More information

Dimension reduction : PCA and Clustering

Dimension reduction : PCA and Clustering Dimension reduction : PCA and Clustering By Hanne Jarmer Slides by Christopher Workman Center for Biological Sequence Analysis DTU The DNA Array Analysis Pipeline Array design Probe design Question Experimental

More information

Enhancing Cluster Quality by Using User Browsing Time

Enhancing Cluster Quality by Using User Browsing Time Enhancing Cluster Quality by Using User Browsing Time Rehab M. Duwairi* and Khaleifah Al.jada'** * Department of Computer Information Systems, Jordan University of Science and Technology, Irbid 22110,

More information

2. Design Methodology

2. Design Methodology Content-aware Email Multiclass Classification Categorize Emails According to Senders Liwei Wang, Li Du s Abstract People nowadays are overwhelmed by tons of coming emails everyday at work or in their daily

More information

COMP 465: Data Mining Classification Basics

COMP 465: Data Mining Classification Basics Supervised vs. Unsupervised Learning COMP 465: Data Mining Classification Basics Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Supervised

More information

SCUBA DIVER: SUBSPACE CLUSTERING OF WEB SEARCH RESULTS

SCUBA DIVER: SUBSPACE CLUSTERING OF WEB SEARCH RESULTS SCUBA DIVER: SUBSPACE CLUSTERING OF WEB SEARCH RESULTS Fatih Gelgi, Srinivas Vadrevu, Hasan Davulcu Department of Computer Science and Engineering, Arizona State University, Tempe, AZ fagelgi@asu.edu,

More information

SNS College of Technology, Coimbatore, India

SNS College of Technology, Coimbatore, India Support Vector Machine: An efficient classifier for Method Level Bug Prediction using Information Gain 1 M.Vaijayanthi and 2 M. Nithya, 1,2 Assistant Professor, Department of Computer Science and Engineering,

More information

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a International Conference on Education Technology, Management and Humanities Science (ETMHS 2015) Research on Applications of Data Mining in Electronic Commerce Xiuping YANG 1, a 1 Computer Science Department,

More information

Generic Face Alignment Using an Improved Active Shape Model

Generic Face Alignment Using an Improved Active Shape Model Generic Face Alignment Using an Improved Active Shape Model Liting Wang, Xiaoqing Ding, Chi Fang Electronic Engineering Department, Tsinghua University, Beijing, China {wanglt, dxq, fangchi} @ocrserv.ee.tsinghua.edu.cn

More information

MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS

MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS J.I. Serrano M.D. Del Castillo Instituto de Automática Industrial CSIC. Ctra. Campo Real km.0 200. La Poveda. Arganda del Rey. 28500

More information

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Data Preprocessing Aggregation Sampling Dimensionality Reduction Feature subset selection Feature creation

More information

Predicting Popular Xbox games based on Search Queries of Users

Predicting Popular Xbox games based on Search Queries of Users 1 Predicting Popular Xbox games based on Search Queries of Users Chinmoy Mandayam and Saahil Shenoy I. INTRODUCTION This project is based on a completed Kaggle competition. Our goal is to predict which

More information

MSA220 - Statistical Learning for Big Data

MSA220 - Statistical Learning for Big Data MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups

More information

Two-Dimensional Visualization for Internet Resource Discovery. Shih-Hao Li and Peter B. Danzig. University of Southern California

Two-Dimensional Visualization for Internet Resource Discovery. Shih-Hao Li and Peter B. Danzig. University of Southern California Two-Dimensional Visualization for Internet Resource Discovery Shih-Hao Li and Peter B. Danzig Computer Science Department University of Southern California Los Angeles, California 90089-0781 fshli, danzigg@cs.usc.edu

More information

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic SEMANTIC COMPUTING Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic TU Dresden, 23 November 2018 Overview Unsupervised Machine Learning overview Association

More information