Using Google s PageRank Algorithm to Identify Important Attributes of Genes

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Using Google s PageRank Algorithm to Identify Important Attributes of Genes"

Transcription

1 Using Google s PageRank Algorithm to Identify Important Attributes of Genes Golam Morshed Osmani Ph.D. Student in Software Engineering Dept. of Computer Science North Dakota State Univesity Fargo, ND Syed (Shawon) M. Rahman, Ph.D. Assistant Professor Computer Science & Software Engineering Dept. University of Wisconsin - Platteville 1 University Plaza, Platteville, WI Abstract In our research, we have applied PageRank technology to identify important attributes of Genes. Google search engine uses this PageRank algorithm to assign a numerical weight of each webpage and cluster search results. We have found PageRank algorithm is a very effective method for selecting significant attributes for high-dimensional data; especially, gene expression data. Important genes can be used in clustering of gene expression data to get a better clustering using minimum time and space. Clustering of high-dimensional data requires lot of resources both in terms of time and memory space. Generally, some attributes are more important than others. We use yeast s gene expression data from Stanford MicroArray Database. Four datasets are used each with approximate 4905 genes and 5 to 14 expression levels. We calculate the correlation matrix between these genes using their expression levels in various experiments. We have used Weka (data mining software in java) to generate and validate a naïve bayes classifier. We found that the data ranked with PageRank algorithm produces better classification than the raw data. The raw data used to build classifier can classify attributes at 48-50% accuracy while PageRanked data classify attributes at 62-64% accuracy.

2 1. Introduction Clustering of high-dimensional data requires lot of resources both in terms of time and memory space. Sometimes it is unrealistic or unfeasible to cluster a dataset using the whole attributes set, particularly when the attribute set is very large. Our attribute selection reduces both memory and computation time requirements and make clustering feasible. At the same time it tries to improve the result of clustering as we are selecting important or influential attributes. Gene expression data has lots of attributes (genes) associated with it. To get the effective clustering within limited amount of time and memory space constraint, attribute selection is to be used. We apply our technique to select important genes in a gene expression dataset. The key idea behind our model is Google s PageRank [1]. We use the concept that if a page is linked to many other highly ranked pages, it has a high probability of getting high rank also. Link between genes are denoted by the high correlation values between these genes. In our experiment we use yeast dataset found from Stanford MicroArray Database [2]. This database contains raw gene expression data. We select four categories and extract raw datasets. Later we filter the datasets and run our algorithm on them. We compare results from our algorithm with the results without using our process. We find significant improvement in selecting important genes. 2. Background 2.1. Attribute Selection Attribute selection is an important problem space on which many researchers are working. Its main aim is to select a subset of important attribute from a dataset and use this subset for further exploration. This is particularly useful when the dataset has a huge number of attributes and we need to classify or cluster the dataset using those attributes. It is observed that not all the attributes have the same importance or influence on the function or the dataset clustering. We can extract some subset of cluster based on its influence on the dataset. The only criteria we need to ensure is that this attribute selection may not throw away attributes that are important or significant. Another related feature of this selection is such that this selected subset of attributes should produce better results or at least similar result compared to using the whole attributes set. Several approaches of attributes selection methods and their comparison can be found here [3]. Some approaches focus on supervised classification [4][5]. Other approaches [6][7] include unsupervised classification where we do not have prior information to 1

3 evaluate potential solution. Boudjeloud and Poulet [8] focuses on Genetic Algorithm based search technique for attribute selection PageRank PageRank algorithm is developed by Google founders Mr. Larry Page and Mr. Sergey Brin. The basic idea behind PageRank algorithm is page has high rank if the sum of the ranks of its backlinks is high. This covers both the case when a page has many backlinks and when a page has a few highly ranked backlinks [1]. Highly ranked Page Page to be ranked Edge in the network Figure 1: Page connected with many highly ranked pages is to be highly ranked. (PageRank principle) Morrison and others [9] uses GeneRank, a modified version of PageRank, to evaluate microarray experiment results Naïve Bayes Classifier A Naïve Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions [10]. Naïve Bayes classifier can be trained very efficiently in a supervised learning scenario. Though Naïve Bayes use over-simplified assumptions and simple design, it often works better in a complex realworld problem domain. One of advantages of using Naïve Bayes classifier is it requires a comparatively small amount of data to estimates its parameters necessary for classification. It is found that under very specific condition Naïve Bayes classifier performs better than most other classifier. Zhang [11] used Naïve Bayes classifier to find ranking of different objects (customers etc.). 3. Algorithm Here we used several steps to achieve our results. The steps are the following: 3.1 Dataset collection 2

4 3.2 For each dataset Build the correlation network Generate bit matrix from correlation network Populate a table denoting the link between each gene identified by the bit at that position Apply PageRank algorithm to that table Normalize the ranks 3.3 Merge these dataset into a single dataset with the function column 3.4 Run the Naïve Bayes Classifier on this dataset 3.5 Build a single dataset from all those initial dataset with the function column 3.6 Run the Naïve Bayes Classifier on this combined dataset 3.7 Compare results from these two classifications 3.1. Dataset Collection We collected Yeast dataset from Stanford MicroArray Database [2] using public login. Four datasets are collected for category Sporulation (7 points sample), Cell-cycle (14 points sample), Transcription (6 points sample) and Stress (5 point sample). Data from Stanford MicroArray Database (SMD) are in raw format and zipped. For each time point we get a MS Excel file and from those files the gene expression values are copied to a single Excel file. Thus we get an excel file with gene id and the gene expression values for each experiment in the corresponding column. Later we copy all those data to a MS Access database for filtering. Sporulation dataset contain 6384 records whereas Cell-cycle, Transcription, Stress, Function contain 7680, 8832, 8448, 7990 records respectively. First we remove duplicate entries. Then we filter the datasets and keep only those genes present in all four dataset and also in the function. After filtering we get 4905 records of common genes in each dataset. These are the dataset we work on. Dataset collection process requires some manual labor, going through the data to find any anomaly, remove duplicity, transferring one format to another for further processing. In our case we used MS Excel and MS Access to collect the base dataset Work on Each Dataset Build the correlation network We find out the correlation between any two genes in the current dataset. We build a matrix that denotes the correlation between all the possible pair of genes in the current dataset. We get a 4905X4905 matrix of correlation coefficient between those genes. This is a symmetric matrix. So both upper half and lower half contain the same information. 3

5 This step is done on MATLAB. We copy the dataset from MS Access table to a single tab delimited text file and run MATLAB script to calculate correlation matrix Generate the bit matrix from correlation network We create bit matrix from the correlation matrix by taking some cutoff value. Here we take 0.5 as cutoff point. However other values may do also. MATLAB script performs this part Populate MySQL table We consider bit position to be the link between the two genes. We populate a MySQL table with this link information using a tuple (from, to). MATLAB script generates a file containing SQL insert statements to populate the table. Later we use MySQL command prompt to insert data using that file Run the PageRank algorithm We run the PageRank algorithm implementation in PHP found at [12] on the table and get the rank of the each gene. Later we use a PHP script to read rank data from rank table and copy this rank data to a MS Access table Normalize the ranks We normalize the ranks to be in [0, 1] interval Merge These Dataset into a Single Dataset with the Function Column We merge ranks of all four dataset into a single dataset and add function column to this newly merged dataset. This part is done in MS Access Run the Naïve Bayes Classifier on This Dataset We use Weka [13] to build a classifier and validate. For this reason we divided our data into train set and test set. Train dataset are created using the records with comparatively high rank value in all those four tables. The only assumption we take that train dataset may have almost equal number of 0, 1 in the function column to make the classification less biased. Weka has a special file format Attribute-Relation File Format(ARFF). We 4

6 converted our data into ARFF format and run Weka to train a Naïve Bayes classifier using train dataset and validate it with test dataset. 3.5 Build a Single Dataset from all Those Initial Dataset with the Function Column We build a combined raw dataset from all the 4 raw dataset before taking correlation and ranks. The function column is also added to this dataset. 3.6 Run the Naïve Bayes Classifier on This Combined Dataset We randomly divide this combined raw dataset into train and test dataset. Then we run a Naïve Bayes classifier using the raw train dataset and validate this classifier with raw test dataset. There are some assumptions we make here. We take almost equal number of record in train dataset as in the train dataset of ranks. We also use almost equal number of 1, 0 in the train dataset to minimize the biasness to a particular value (in our case it is 0). 3.7 Compare Results From These Two Classifications We compare the result we found from Weka and compare the graph provided by the Weka. We found significant improvement in classification accuracy when we use ranking. 4. Data Statistics Initial number of records in each raw dataset. Number of genes common to all these dataset: 4905 Table 1: Number of record in each raw dataset collected from SMD [2]. Recordset # of records #of experiment points Sporulation Cell-cycle Transcription Stress Function Compared Data Statistics In Table 2, we have compared the statistics between raw dataset and using our algorithm. 5

7 Table 2. Comparison of the statistics between raw dataset and using our algorithm. Raw Dataset Statistics Using Our Algorithm Train dataset: Train dataset: Total # of records = 121 # of records containing 1 as function value = 59 # of records containing 0 as function value = 62 Test dataset: Total # of records = 4784 # of records containing 1 as function value = 1490 # of records containing 0 as function value = 3294 Total # of records = 108 # of records containing 1 as function value = 52 # of records containing 0 as function value = 56 Test dataset: Total # of records = 4797 # of records containing 1 as function value = 1497 # of records containing 0 as function value = Time statistics: All statiscs are taken on a Pentium Centrino 1.6GHz with 768 MB RAM. Time to generate correlation matrix and create a MySQL bulk insert file = 2min 25 sec Time to insert into MySQL table = 32 min 10 sec Time to run PageRank algorithm = 2 hour 10 min The above statistics is for Sporulation dataset only. Other datasets also take somewhat similar amount of time except Cell-cycle dataset which takes little bit less amount of time in second and third cases. 5. Results and discussion We get promising result after using our algorithm. We run Naïve Bayes classifier in Weka using the raw dataset (without applying our algorithm). The result is not good. Only 48.43% of data is correctly classified. Result of classification using raw data Correctly Classified Instances % Incorrectly Classified Instances % TP Rate FP Rate Precision Recall F-Measure Class 6

8 When we apply our algorithm and run Weka again we get % of correctly classified instance a major improvement (13%). Thus we see our method produces better results. Gene can be classified according to the function whether it has a significant impact on the function or not, thereby selected for further clustering. Result of classification after applying our algorithm Correctly Classified Instances % Incorrectly Classified Instances % TP Rate FP Rate Precision Recall F-Measure Class We make some assumptions which need to be verified. However, our results show that those assumptions do not cause any bad impact. Extensive study and experiments by taking different assumptions may reveal the best assumptions that to be made. For example, we took 0.5 as cutoff point when producing bit matrix from correlation network. We may try with our values also. We also take almost equal values of 0 and 1 to build the train dataset. Our observations show that it reduces the chance of biasness to a particular value (in our case 0). 6. Conclusion In this paper, we have presented a method for selecting important attributes (genes) from a gene expression dataset. Our method used Google s PageRank algorithm to find ranking of different genes based on their correlation network. Then high-ranking genes were used to build a Naïve Bayes classifier. We compared our approach with raw data and find significant improvement. Thus we have shown that our method produced better result compared to the raw data method. We need to do further research for finding optimal or suitable value for cutoff would be a place for improvement. Before computing correlation matrix, we may also normalize the values. Taking top 100/200 selection irrespective of number of 0/1 and build a classifier based on that would be another area. Currently, the classifier fails to train itself if 0 and 1 are not in a balanced number (their proportion doesn t vary much from 1:1). 7

9 References 1. Larry Page, Sergey Brin, R. Motwani, T. Winograd: The PageRank Citation Ranking: Bringing Order to the Web. Technical report, Stanford (Santa Barbara, CA 93106, January 1998). 2. Stanford MicroArray Database (http://genome-www5.stanford.edu/), web retrieve on December 10, Borges, H.B. Nievola, J.C.(2005); Attribute Selection Methods Comparison for Classification of Diffuse Large B-Cell Lymphoma, In Proceedings of the Fourth International Conference on Machine Learning and Applications, December G. John, R. Kohavi, and K. Pfleger; Irrelevant features and subset selection problem, In Morgan Kaufmann New Brunswick, NJ, editor, the eleventh International Conference on Machine Learning, pages , H. Liu and H. Motod; Feature selection for knowledge discovery and data mining. In Kluwer International Series in Engineering and Computer Science, Secs, M. Dash and H. Liu; Feature selection for clustering. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages , Y. Kim, W. N. Street, and F. Menczer. Evolutionary model selection in unsupervised learning. volume 6, pages IOS Press, Lydia Boudjeloud and Fran cois Poulet; Attribute Selection for High Dimensional Data Clustering, In Proceedings of the Applied Stochastic Models and Data Analysis Julie L. Morrison, Rainer Breitling, Desmond J. Higham, David R. Gilbert: GeneRank: Using search engine technology for the analysis of microarray experiments. BMC Bioinformatics 6: 233 (2005). 10. Naïve Bayes classifier (http://en.wikipedia.org/wiki/naive_bayes_classifier/), web retrieve on December 10, H. Zhang and J. S; Naive Bayesian classifiers for ranking, Proceedings of the 15th European Conference on Machine Learning (ECML2004), Springer(2004). 12. PageRank implementation in PHP (http://gxrank.googlecode.com/svn/trunk/), web retrieve on December 10, Ian H. Witten and Eibe Frank (2005) "Data Mining: Practical machine learning tools and techniques", 2nd Edition, Morgan Kaufmann, San Francisco,

WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov

WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov WEKA: Practical Machine Learning Tools and Techniques in Java Seminar A.I. Tools WS 2006/07 Rossen Dimov Overview Basic introduction to Machine Learning Weka Tool Conclusion Document classification Demo

More information

The Role of Biomedical Dataset in Classification

The Role of Biomedical Dataset in Classification The Role of Biomedical Dataset in Classification Ajay Kumar Tanwani and Muddassar Farooq Next Generation Intelligent Networks Research Center (nexgin RC) National University of Computer & Emerging Sciences

More information

Feature Subset Selection Problem using Wrapper Approach in Supervised Learning

Feature Subset Selection Problem using Wrapper Approach in Supervised Learning Feature Subset Selection Problem using Wrapper Approach in Supervised Learning Asha Gowda Karegowda Dept. of Master of Computer Applications Technology Tumkur, Karnataka,India M.A.Jayaram Dept. of Master

More information

Study on Classifiers using Genetic Algorithm and Class based Rules Generation

Study on Classifiers using Genetic Algorithm and Class based Rules Generation 2012 International Conference on Software and Computer Applications (ICSCA 2012) IPCSIT vol. 41 (2012) (2012) IACSIT Press, Singapore Study on Classifiers using Genetic Algorithm and Class based Rules

More information

Feature Selection Algorithm with Discretization and PSO Search Methods for Continuous Attributes

Feature Selection Algorithm with Discretization and PSO Search Methods for Continuous Attributes Feature Selection Algorithm with Discretization and PSO Search Methods for Continuous Attributes Madhu.G 1, Rajinikanth.T.V 2, Govardhan.A 3 1 Dept of Information Technology, VNRVJIET, Hyderabad-90, INDIA,

More information

UNSUPERVISED STATIC DISCRETIZATION METHODS IN DATA MINING. Daniela Joiţa Titu Maiorescu University, Bucharest, Romania

UNSUPERVISED STATIC DISCRETIZATION METHODS IN DATA MINING. Daniela Joiţa Titu Maiorescu University, Bucharest, Romania UNSUPERVISED STATIC DISCRETIZATION METHODS IN DATA MINING Daniela Joiţa Titu Maiorescu University, Bucharest, Romania danielajoita@utmro Abstract Discretization of real-valued data is often used as a pre-processing

More information

Web Structure Mining using Link Analysis Algorithms

Web Structure Mining using Link Analysis Algorithms Web Structure Mining using Link Analysis Algorithms Ronak Jain Aditya Chavan Sindhu Nair Assistant Professor Abstract- The World Wide Web is a huge repository of data which includes audio, text and video.

More information

COMP5331: Knowledge Discovery and Data Mining

COMP5331: Knowledge Discovery and Data Mining COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd, Jon M. Kleinberg 1 1 PageRank

More information

Feature Ranking in Intrusion Detection Dataset using Combination of Filtering Methods

Feature Ranking in Intrusion Detection Dataset using Combination of Filtering Methods Feature Ranking in Intrusion Detection Dataset using Combination of Filtering Methods Zahra Karimi Islamic Azad University Tehran North Branch Dept. of Computer Engineering Tehran, Iran Mohammad Mansour

More information

CS 8803 AIAD Prof Ling Liu. Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai

CS 8803 AIAD Prof Ling Liu. Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai CS 8803 AIAD Prof Ling Liu Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai Under the supervision of Steve Webb Motivations and Objectives Spam, which was until

More information

Concept Tree Based Clustering Visualization with Shaded Similarity Matrices

Concept Tree Based Clustering Visualization with Shaded Similarity Matrices Syracuse University SURFACE School of Information Studies: Faculty Scholarship School of Information Studies (ischool) 12-2002 Concept Tree Based Clustering Visualization with Shaded Similarity Matrices

More information

COMPARISON OF DIFFERENT CLASSIFICATION TECHNIQUES

COMPARISON OF DIFFERENT CLASSIFICATION TECHNIQUES COMPARISON OF DIFFERENT CLASSIFICATION TECHNIQUES USING DIFFERENT DATASETS V. Vaithiyanathan 1, K. Rajeswari 2, Kapil Tajane 3, Rahul Pitale 3 1 Associate Dean Research, CTS Chair Professor, SASTRA University,

More information

Contents. ACE Presentation. Comparison with existing frameworks. Technical aspects. ACE 2.0 and future work. 24 October 2009 ACE 2

Contents. ACE Presentation. Comparison with existing frameworks. Technical aspects. ACE 2.0 and future work. 24 October 2009 ACE 2 ACE Contents ACE Presentation Comparison with existing frameworks Technical aspects ACE 2.0 and future work 24 October 2009 ACE 2 ACE Presentation 24 October 2009 ACE 3 ACE Presentation Framework for using

More information

CloNI: clustering of JN -interval discretization

CloNI: clustering of JN -interval discretization CloNI: clustering of JN -interval discretization C. Ratanamahatana Department of Computer Science, University of California, Riverside, USA Abstract It is known that the naive Bayesian classifier typically

More information

A Survey on Postive and Unlabelled Learning

A Survey on Postive and Unlabelled Learning A Survey on Postive and Unlabelled Learning Gang Li Computer & Information Sciences University of Delaware ligang@udel.edu Abstract In this paper we survey the main algorithms used in positive and unlabeled

More information

Using Decision Boundary to Analyze Classifiers

Using Decision Boundary to Analyze Classifiers Using Decision Boundary to Analyze Classifiers Zhiyong Yan Congfu Xu College of Computer Science, Zhejiang University, Hangzhou, China yanzhiyong@zju.edu.cn Abstract In this paper we propose to use decision

More information

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology 9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example

More information

Machine Learning Techniques for Data Mining

Machine Learning Techniques for Data Mining Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART V Credibility: Evaluating what s been learned 10/25/2000 2 Evaluation: the key to success How

More information

Attribute Discretization and Selection. Clustering. NIKOLA MILIKIĆ UROŠ KRČADINAC

Attribute Discretization and Selection. Clustering. NIKOLA MILIKIĆ UROŠ KRČADINAC Attribute Discretization and Selection Clustering NIKOLA MILIKIĆ nikola.milikic@fon.bg.ac.rs UROŠ KRČADINAC uros@krcadinac.com Naive Bayes Features Intended primarily for the work with nominal attributes

More information

Big Data Analytics for Host Misbehavior Detection

Big Data Analytics for Host Misbehavior Detection Big Data Analytics for Host Misbehavior Detection Miguel Pupo Correia joint work with Daniel Gonçalves, João Bota (Vodafone PT) 2016 European Security Conference June 2016 Motivation Networks are complex,

More information

Weka ( )

Weka (  ) Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised

More information

Redundancy Based Feature Selection for Microarray Data

Redundancy Based Feature Selection for Microarray Data Redundancy Based Feature Selection for Microarray Data Lei Yu Department of Computer Science & Engineering Arizona State University Tempe, AZ 85287-8809 leiyu@asu.edu Huan Liu Department of Computer Science

More information

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka Data Preprocessing Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville ranka@cise.ufl.edu Data Preprocessing What preprocessing step can or should

More information

NETWORK FAULT DETECTION - A CASE FOR DATA MINING

NETWORK FAULT DETECTION - A CASE FOR DATA MINING NETWORK FAULT DETECTION - A CASE FOR DATA MINING Poonam Chaudhary & Vikram Singh Department of Computer Science Ch. Devi Lal University, Sirsa ABSTRACT: Parts of the general network fault management problem,

More information

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2 161 Machine Learning Hierarchical clustering Reading: Bishop: 9-9.2 Second half: Overview Clustering - Hierarchical, semi-supervised learning Graphical models - Bayesian networks, HMMs, Reasoning under

More information

Data Preprocessing. Data Preprocessing

Data Preprocessing. Data Preprocessing Data Preprocessing Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville ranka@cise.ufl.edu Data Preprocessing What preprocessing step can or should

More information

Enhancing Cluster Quality by Using User Browsing Time

Enhancing Cluster Quality by Using User Browsing Time Enhancing Cluster Quality by Using User Browsing Time Rehab Duwairi Dept. of Computer Information Systems Jordan Univ. of Sc. and Technology Irbid, Jordan rehab@just.edu.jo Khaleifah Al.jada' Dept. of

More information

Enhanced Retrieval of Web Pages using Improved Page Rank Algorithm

Enhanced Retrieval of Web Pages using Improved Page Rank Algorithm Enhanced Retrieval of Web Pages using Improved Page Rank Algorithm Rekha Jain 1, Sulochana Nathawat 2, Dr. G.N. Purohit 3 1 Department of Computer Science, Banasthali University, Jaipur, Rajasthan ABSTRACT

More information

A New Technique for Ranking Web Pages and Adwords

A New Technique for Ranking Web Pages and Adwords A New Technique for Ranking Web Pages and Adwords K. P. Shyam Sharath Jagannathan Maheswari Rajavel, Ph.D ABSTRACT Web mining is an active research area which mainly deals with the application on data

More information

Feature Selection for Supervised Classification: A Kolmogorov- Smirnov Class Correlation-Based Filter

Feature Selection for Supervised Classification: A Kolmogorov- Smirnov Class Correlation-Based Filter Feature Selection for Supervised Classification: A Kolmogorov- Smirnov Class Correlation-Based Filter Marcin Blachnik 1), Włodzisław Duch 2), Adam Kachel 1), Jacek Biesiada 1,3) 1) Silesian University

More information

Accelerometer Gesture Recognition

Accelerometer Gesture Recognition Accelerometer Gesture Recognition Michael Xie xie@cs.stanford.edu David Pan napdivad@stanford.edu December 12, 2014 Abstract Our goal is to make gesture-based input for smartphones and smartwatches accurate

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 20 Table of contents 1 Introduction 2 Data mining

More information

Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values

Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values SHARIQ BASHIR, SAAD RAZZAQ, UMER MAQBOOL, SONYA TAHIR, A. RAUF BAIG Department of Computer Science (Machine

More information

Mathematical Methods and Computational Algorithms for Complex Networks. Benard Abola

Mathematical Methods and Computational Algorithms for Complex Networks. Benard Abola Mathematical Methods and Computational Algorithms for Complex Networks Benard Abola Division of Applied Mathematics, Mälardalen University Department of Mathematics, Makerere University Second Network

More information

Graphs / Networks. CSE 6242/ CX 4242 Feb 18, Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Graphs / Networks. CSE 6242/ CX 4242 Feb 18, Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech CSE 6242/ CX 4242 Feb 18, 2014 Graphs / Networks Centrality measures, algorithms, interactive applications Duen Horng (Polo) Chau Georgia Tech Partly based on materials by Professors Guy Lebanon, Jeffrey

More information

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node A Modified Algorithm to Handle Dangling Pages using Hypothetical Node Shipra Srivastava Student Department of Computer Science & Engineering Thapar University, Patiala, 147001 (India) Rinkle Rani Aggrawal

More information

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate Density estimation In density estimation problems, we are given a random sample from an unknown density Our objective is to estimate? Applications Classification If we estimate the density for each class,

More information

Feature Selection Using Modified-MCA Based Scoring Metric for Classification

Feature Selection Using Modified-MCA Based Scoring Metric for Classification 2011 International Conference on Information Communication and Management IPCSIT vol.16 (2011) (2011) IACSIT Press, Singapore Feature Selection Using Modified-MCA Based Scoring Metric for Classification

More information

PAGE RANK ON MAP- REDUCE PARADIGM

PAGE RANK ON MAP- REDUCE PARADIGM PAGE RANK ON MAP- REDUCE PARADIGM Group 24 Nagaraju Y Thulasi Ram Naidu P Dhanush Chalasani Agenda Page Rank - introduction An example Page Rank in Map-reduce framework Dataset Description Work flow Modules.

More information

Feature-weighted k-nearest Neighbor Classifier

Feature-weighted k-nearest Neighbor Classifier Proceedings of the 27 IEEE Symposium on Foundations of Computational Intelligence (FOCI 27) Feature-weighted k-nearest Neighbor Classifier Diego P. Vivencio vivencio@comp.uf scar.br Estevam R. Hruschka

More information

Query Disambiguation from Web Search Logs

Query Disambiguation from Web Search Logs Vol.133 (Information Technology and Computer Science 2016), pp.90-94 http://dx.doi.org/10.14257/astl.2016. Query Disambiguation from Web Search Logs Christian Højgaard 1, Joachim Sejr 2, and Yun-Gyung

More information

Data Mining in Bioinformatics Day 1: Classification

Data Mining in Bioinformatics Day 1: Classification Data Mining in Bioinformatics Day 1: Classification Karsten Borgwardt February 18 to March 1, 2013 Machine Learning & Computational Biology Research Group Max Planck Institute Tübingen and Eberhard Karls

More information

A brief history of Google

A brief history of Google the math behind Sat 25 March 2006 A brief history of Google 1995-7 The Stanford days (aka Backrub(!?)) 1998 Yahoo! wouldn't buy (but they might invest...) 1999 Finally out of beta! Sergey Brin Larry Page

More information

Performance Analysis of Data Mining Classification Techniques

Performance Analysis of Data Mining Classification Techniques Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal

More information

Supervised vs unsupervised clustering

Supervised vs unsupervised clustering Classification Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful

More information

Part I: Data Mining Foundations

Part I: Data Mining Foundations Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?

More information

Parallel Fuzzy c-means Cluster Analysis

Parallel Fuzzy c-means Cluster Analysis Parallel Fuzzy c-means Cluster Analysis Marta V. Modenesi; Myrian C. A. Costa, Alexandre G. Evsukoff and Nelson F. F. Ebecken COPPE/Federal University of Rio de Janeiro, P.O.Box 6856, 945-97 Rio de Janeiro

More information

Epitopes Toolkit (EpiT) Yasser EL-Manzalawy August 30, 2016

Epitopes Toolkit (EpiT) Yasser EL-Manzalawy  August 30, 2016 Epitopes Toolkit (EpiT) Yasser EL-Manzalawy http://www.cs.iastate.edu/~yasser August 30, 2016 What is EpiT? Epitopes Toolkit (EpiT) is a platform for developing epitope prediction tools. An EpiT developer

More information

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics Helmut Berger and Dieter Merkl 2 Faculty of Information Technology, University of Technology, Sydney, NSW, Australia hberger@it.uts.edu.au

More information

Enhancing Cluster Quality by Using User Browsing Time

Enhancing Cluster Quality by Using User Browsing Time Enhancing Cluster Quality by Using User Browsing Time Rehab M. Duwairi* and Khaleifah Al.jada'** * Department of Computer Information Systems, Jordan University of Science and Technology, Irbid 22110,

More information

Analytical survey of Web Page Rank Algorithm

Analytical survey of Web Page Rank Algorithm Analytical survey of Web Page Rank Algorithm Mrs.M.Usha 1, Dr.N.Nagadeepa 2 Research Scholar, Bharathiyar University,Coimbatore 1 Associate Professor, Jairams Arts and Science College, Karur 2 ABSTRACT

More information

Machine Learning: Symbolische Ansätze

Machine Learning: Symbolische Ansätze Machine Learning: Symbolische Ansätze Unsupervised Learning Clustering Association Rules V2.0 WS 10/11 J. Fürnkranz Different Learning Scenarios Supervised Learning A teacher provides the value for the

More information

Fuzzy Ant Clustering by Centroid Positioning

Fuzzy Ant Clustering by Centroid Positioning Fuzzy Ant Clustering by Centroid Positioning Parag M. Kanade and Lawrence O. Hall Computer Science & Engineering Dept University of South Florida, Tampa FL 33620 @csee.usf.edu Abstract We

More information

Evaluating Classifiers

Evaluating Classifiers Evaluating Classifiers Charles Elkan elkan@cs.ucsd.edu January 18, 2011 In a real-world application of supervised learning, we have a training set of examples with labels, and a test set of examples with

More information

Efficient Tuning of SVM Hyperparameters Using Radius/Margin Bound and Iterative Algorithms

Efficient Tuning of SVM Hyperparameters Using Radius/Margin Bound and Iterative Algorithms IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 5, SEPTEMBER 2002 1225 Efficient Tuning of SVM Hyperparameters Using Radius/Margin Bound and Iterative Algorithms S. Sathiya Keerthi Abstract This paper

More information

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which

More information

Comparative Study of J48, Naive Bayes and One-R Classification Technique for Credit Card Fraud Detection using WEKA

Comparative Study of J48, Naive Bayes and One-R Classification Technique for Credit Card Fraud Detection using WEKA Advances in Computational Sciences and Technology ISSN 0973-6107 Volume 10, Number 6 (2017) pp. 1731-1743 Research India Publications http://www.ripublication.com Comparative Study of J48, Naive Bayes

More information

Anomaly Detection on Data Streams with High Dimensional Data Environment

Anomaly Detection on Data Streams with High Dimensional Data Environment Anomaly Detection on Data Streams with High Dimensional Data Environment Mr. D. Gokul Prasath 1, Dr. R. Sivaraj, M.E, Ph.D., 2 Department of CSE, Velalar College of Engineering & Technology, Erode 1 Assistant

More information

Word Disambiguation in Web Search

Word Disambiguation in Web Search Word Disambiguation in Web Search Rekha Jain Computer Science, Banasthali University, Rajasthan, India Email: rekha_leo2003@rediffmail.com G.N. Purohit Computer Science, Banasthali University, Rajasthan,

More information

WEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1

WEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1 WEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1 H. Altay Güvenir and Aynur Akkuş Department of Computer Engineering and Information Science Bilkent University, 06533, Ankara, Turkey

More information

Using PageRank in Feature Selection

Using PageRank in Feature Selection Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy fienco,meo,bottag@di.unito.it Abstract. Feature selection is an important

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,

More information

FADA: An Efficient Dimension Reduction Scheme for Image Classification

FADA: An Efficient Dimension Reduction Scheme for Image Classification Best Paper Candidate in Retrieval rack, Pacific-rim Conference on Multimedia, December 11-14, 7, Hong Kong. FADA: An Efficient Dimension Reduction Scheme for Image Classification Yijuan Lu 1, Jingsheng

More information

Lecture 9: Support Vector Machines

Lecture 9: Support Vector Machines Lecture 9: Support Vector Machines William Webber (william@williamwebber.com) COMP90042, 2014, Semester 1, Lecture 8 What we ll learn in this lecture Support Vector Machines (SVMs) a highly robust and

More information

A Network Intrusion Detection System Architecture Based on Snort and. Computational Intelligence

A Network Intrusion Detection System Architecture Based on Snort and. Computational Intelligence 2nd International Conference on Electronics, Network and Computer Engineering (ICENCE 206) A Network Intrusion Detection System Architecture Based on Snort and Computational Intelligence Tao Liu, a, Da

More information

A Two Stage Zone Regression Method for Global Characterization of a Project Database

A Two Stage Zone Regression Method for Global Characterization of a Project Database A Two Stage Zone Regression Method for Global Characterization 1 Chapter I A Two Stage Zone Regression Method for Global Characterization of a Project Database J. J. Dolado, University of the Basque Country,

More information

Supplementary text S6 Comparison studies on simulated data

Supplementary text S6 Comparison studies on simulated data Supplementary text S Comparison studies on simulated data Peter Langfelder, Rui Luo, Michael C. Oldham, and Steve Horvath Corresponding author: shorvath@mednet.ucla.edu Overview In this document we illustrate

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest

Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest Bhakti V. Gavali 1, Prof. Vivekanand Reddy 2 1 Department of Computer Science and Engineering, Visvesvaraya Technological

More information

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A.

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Credibility: Evaluating what s been learned Issues: training, testing,

More information

CT79 SOFT COMPUTING ALCCS-FEB 2014

CT79 SOFT COMPUTING ALCCS-FEB 2014 Q.1 a. Define Union, Intersection and complement operations of Fuzzy sets. For fuzzy sets A and B Figure Fuzzy sets A & B The union of two fuzzy sets A and B is a fuzzy set C, written as C=AUB or C=A OR

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

SOLOMON: Parentage Analysis 1. Corresponding author: Mark Christie

SOLOMON: Parentage Analysis 1. Corresponding author: Mark Christie SOLOMON: Parentage Analysis 1 Corresponding author: Mark Christie christim@science.oregonstate.edu SOLOMON: Parentage Analysis 2 Table of Contents: Installing SOLOMON on Windows/Linux Pg. 3 Installing

More information

FEATURE SELECTION BASED ON INFORMATION THEORY, CONSISTENCY AND SEPARABILITY INDICES.

FEATURE SELECTION BASED ON INFORMATION THEORY, CONSISTENCY AND SEPARABILITY INDICES. FEATURE SELECTION BASED ON INFORMATION THEORY, CONSISTENCY AND SEPARABILITY INDICES. Włodzisław Duch 1, Krzysztof Grąbczewski 1, Tomasz Winiarski 1, Jacek Biesiada 2, Adam Kachel 2 1 Dept. of Informatics,

More information

GENERATING HIGH LEVEL CONTEXT FROM SENSOR DATA FOR MOBILE APPLICATIONS

GENERATING HIGH LEVEL CONTEXT FROM SENSOR DATA FOR MOBILE APPLICATIONS GENERATING HIGH LEVEL CONTEXT FROM SENSOR DATA FOR MOBILE APPLICATIONS Wolfgang Woerndl 1, Christian Schueller 2, Thomas Rottach 1,2 1 Technische Universitaet Muenchen, Institut fuer Informatik Boltzmannstr.

More information

SIMILARITY MEASURES FOR MULTI-VALUED ATTRIBUTES FOR DATABASE CLUSTERING

SIMILARITY MEASURES FOR MULTI-VALUED ATTRIBUTES FOR DATABASE CLUSTERING SIMILARITY MEASURES FOR MULTI-VALUED ATTRIBUTES FOR DATABASE CLUSTERING TAE-WAN RYU AND CHRISTOPH F. EICK Department of Computer Science, University of Houston, Houston, Texas 77204-3475 {twryu, ceick}@cs.uh.edu

More information

ASSESSING INVARIANT MINING TECHNIQUES FOR CLOUD-BASED UTILITY COMPUTING SYSTEMS

ASSESSING INVARIANT MINING TECHNIQUES FOR CLOUD-BASED UTILITY COMPUTING SYSTEMS ASSESSING INVARIANT MINING TECHNIQUES FOR CLOUD-BASED UTILITY COMPUTING SYSTEMS ABSTRACT Likely system invariants model properties that hold in operating conditions of a computing system. Invariants may

More information

Supervised and Unsupervised Learning (II)

Supervised and Unsupervised Learning (II) Supervised and Unsupervised Learning (II) Yong Zheng Center for Web Intelligence DePaul University, Chicago IPD 346 - Data Science for Business Program DePaul University, Chicago, USA Intro: Supervised

More information

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How

More information

The Explorer. chapter Getting started

The Explorer. chapter Getting started chapter 10 The Explorer Weka s main graphical user interface, the Explorer, gives access to all its facilities using menu selection and form filling. It is illustrated in Figure 10.1. There are six different

More information

Order Preserving Triclustering Algorithm. (Version1.0)

Order Preserving Triclustering Algorithm. (Version1.0) Order Preserving Triclustering Algorithm User Manual (Version1.0) Alain B. Tchagang alain.tchagang@nrc-cnrc.gc.ca Ziying Liu ziying.liu@nrc-cnrc.gc.ca Sieu Phan sieu.phan@nrc-cnrc.gc.ca Fazel Famili fazel.famili@nrc-cnrc.gc.ca

More information

Document Clustering: Comparison of Similarity Measures

Document Clustering: Comparison of Similarity Measures Document Clustering: Comparison of Similarity Measures Shouvik Sachdeva Bhupendra Kastore Indian Institute of Technology, Kanpur CS365 Project, 2014 Outline 1 Introduction The Problem and the Motivation

More information

Forward Feature Selection Using Residual Mutual Information

Forward Feature Selection Using Residual Mutual Information Forward Feature Selection Using Residual Mutual Information Erik Schaffernicht, Christoph Möller, Klaus Debes and Horst-Michael Gross Ilmenau University of Technology - Neuroinformatics and Cognitive Robotics

More information

Why MultiLayer Perceptron/Neural Network? Objective: Attributes:

Why MultiLayer Perceptron/Neural Network? Objective: Attributes: Why MultiLayer Perceptron/Neural Network? Neural networks, with their remarkable ability to derive meaning from complicated or imprecise data, can be used to extract patterns and detect trends that are

More information

STUDYING OF CLASSIFYING CHINESE SMS MESSAGES

STUDYING OF CLASSIFYING CHINESE SMS MESSAGES STUDYING OF CLASSIFYING CHINESE SMS MESSAGES BASED ON BAYESIAN CLASSIFICATION 1 LI FENG, 2 LI JIGANG 1,2 Computer Science Department, DongHua University, Shanghai, China E-mail: 1 Lifeng@dhu.edu.cn, 2

More information

Neural Network Weight Selection Using Genetic Algorithms

Neural Network Weight Selection Using Genetic Algorithms Neural Network Weight Selection Using Genetic Algorithms David Montana presented by: Carl Fink, Hongyi Chen, Jack Cheng, Xinglong Li, Bruce Lin, Chongjie Zhang April 12, 2005 1 Neural Networks Neural networks

More information

COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION

COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION International Journal of Computer Engineering and Applications, Volume IX, Issue VIII, Sep. 15 www.ijcea.com ISSN 2321-3469 COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION

More information

Calculating Web Page Authority Using the PageRank Algorithm. Math 45, Fall 2005 Levi Gill and Jacob Miles Prystowsky

Calculating Web Page Authority Using the PageRank Algorithm. Math 45, Fall 2005 Levi Gill and Jacob Miles Prystowsky Calculating Web Page Authority Using the PageRank Algorithm Math 45, Fall 2005 Levi Gill and Jacob Miles Prystowsky Introduction In 1998 a phenomenon hit the World Wide Web: Google opened its doors. Larry

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

The PageRank Citation Ranking: Bringing Order to the Web

The PageRank Citation Ranking: Bringing Order to the Web The PageRank Citation Ranking: Bringing Order to the Web Marlon Dias msdias@dcc.ufmg.br Information Retrieval DCC/UFMG - 2017 Introduction Paper: The PageRank Citation Ranking: Bringing Order to the Web,

More information

Classifying Documents by Distributed P2P Clustering

Classifying Documents by Distributed P2P Clustering Classifying Documents by Distributed P2P Clustering Martin Eisenhardt Wolfgang Müller Andreas Henrich Chair of Applied Computer Science I University of Bayreuth, Germany {eisenhardt mueller2 henrich}@uni-bayreuth.de

More information

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset M.Hamsathvani 1, D.Rajeswari 2 M.E, R.Kalaiselvi 3 1 PG Scholar(M.E), Angel College of Engineering and Technology, Tiruppur,

More information

An Introduction to WEKA Explorer. In part from: Yizhou Sun 2008

An Introduction to WEKA Explorer. In part from: Yizhou Sun 2008 An Introduction to WEKA Explorer In part from: Yizhou Sun 2008 What is WEKA? Waikato Environment for Knowledge Analysis It s a data mining/machine learning tool developed by Department of Computer Science,,

More information

A genetic algorithm based focused Web crawler for automatic webpage classification

A genetic algorithm based focused Web crawler for automatic webpage classification A genetic algorithm based focused Web crawler for automatic webpage classification Nancy Goyal, Rajesh Bhatia, Manish Kumar Computer Science and Engineering, PEC University of Technology, Chandigarh, India

More information

Using PageRank in Feature Selection

Using PageRank in Feature Selection Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy {ienco,meo,botta}@di.unito.it Abstract. Feature selection is an important

More information

STEM. Short Time-series Expression Miner (v1.1) User Manual

STEM. Short Time-series Expression Miner (v1.1) User Manual STEM Short Time-series Expression Miner (v1.1) User Manual Jason Ernst (jernst@cs.cmu.edu) Ziv Bar-Joseph Center for Automated Learning and Discovery School of Computer Science Carnegie Mellon University

More information

Data Mining. Chapter 1: Introduction. Adapted from materials by Jiawei Han, Micheline Kamber, and Jian Pei

Data Mining. Chapter 1: Introduction. Adapted from materials by Jiawei Han, Micheline Kamber, and Jian Pei Data Mining Chapter 1: Introduction Adapted from materials by Jiawei Han, Micheline Kamber, and Jian Pei 1 Any Question? Just Ask 3 Chapter 1. Introduction Why Data Mining? What Is Data Mining? A Multi-Dimensional

More information

CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA

CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA An Implementation Amit Chawla 11/M.Tech/01, CSE Department Sat Priya Group of Institutions, Rohtak (Haryana), INDIA anshmahi@gmail.com

More information

Normalization based K means Clustering Algorithm

Normalization based K means Clustering Algorithm Normalization based K means Clustering Algorithm Deepali Virmani 1,Shweta Taneja 2,Geetika Malhotra 3 1 Department of Computer Science,Bhagwan Parshuram Institute of Technology,New Delhi Email:deepalivirmani@gmail.com

More information

A Genetic Algorithm Approach for Clustering

A Genetic Algorithm Approach for Clustering www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 3 Issue 6 June, 2014 Page No. 6442-6447 A Genetic Algorithm Approach for Clustering Mamta Mor 1, Poonam Gupta

More information