Supervised Clustering of Yeast Gene Expression Data

Size: px
Start display at page:

Download "Supervised Clustering of Yeast Gene Expression Data"

Transcription

1 Supervised Clustering of Yeast Gene Expression Data In the DeRisi paper five expression profile clusters were cited, each containing a small number (7-8) of genes. In the following examples we apply supervised clustering techniques to these cluster prototypes, classifying the remaining genes in the dataset. Classifiers were first trained on the genes in the original clusters, and then applied to the remaining genes to assign them to a cluster. In the first example, a Kohonen self-organizing feature map was used to arrange the original clusters in a two dimensional layout. The unclassified genes were mapped using this layout, creating a clustering of the genes. New clusters were defined by selecting a region of the map corresponding to each new cluster, thus classifying the genes within that region. In the second example, a decision tree was produced by training it on the original clusters. An extra cluster was added to represent those genes not sufficiently satisfying the original DeRisi cluster expression profiles. The remaining genes were filtered removing those without a significant change in expression level, and were then classified by the decision tree. In the third example, a Naive-Bayes classifier was generated from the original clusters.

2 15 DeRisi Cluster Expression Profiles 10 Fold Change Time Centroid B (n=7) Centroid C (n=7) Centroid D (n=7) Centroid E (n=7) Centroid F (n=8) The original DeRisi clusters are represented by a graph of the cluster centroids.

3 A parallel coordinates visualization displaying gene expression levels for each DeRisi cluster.

4 A Kohonen self-organizing feature map computes a new pair of axes and locates the genes according to its idea of similarity.

5 A Kohonen self-organizing feature map displaying user defined clusters.

6 Kohonen Map Cluster Expression Profiles Fold Change Time Centroid none (n=5730) Centroid newb (n=17) Centroid newf (n=210) Centroid newd (n=143) Centroid newc (n=26) Centroid newe (n=27) The Kohonen self-organizing feature map clusters presented by a plot of the cluster centroids.

7 A parallel coordinates visualization showing the new Kohonen map clusters as compared to the original Derisi clusters.

8 A visualization of a decision tree that was created from the original DeRisi clusters (plus an extra None cluster). This part of the subtree shows clusters E and F being split from cluster None at time 18.5, and clusters E and F being split apart at time 14.5.

9 Decision Tree Cluster Expression Profiles Fold Change Time Centroid none (n=197) Centroid C (n=21) Centroid E (n=71) Centroid B (n=47) Centroid D (n=72) Centroid F (n=347) The decision tree clusters presented by a plot of the cluster centroids.

10 Visualization of the Naive-Bayes classifier created from the original DeRisi clusters. The attributes are listed in order of importance (with respect to the cluster designation). The fact that the squares for time 18.5 are mostly one color indicates time 18.5 is a very good predictor for the cluster class.

11 This visualization of the Naive-Bayes classifier shows the probability distribution for cluster D. Cluster D can be classified perfectly from attribute T18.5 alone.

12 Cluster G2/M (n=195) Time Points Expression levels of the five yeast cell cycle peak phases, as designated from the Spellman dataset. The average of each cluster is plotted for all time periods (T0-T160), along with the standard deviation values for each peak phase. T0 T10 T20 T30 T40 T50 T60 T70 T80 T90 T100 T110 T120 T130 T140 T150 T160 T0 T10 T20 T30 T40 T50 T60 T70 T80 T90 T100 T110 T120 T130 T140 T150 T160 T0 T10 T20 T30 T40 T50 T60 T70 T80 T90 T100 T110 T120 T130 T140 T150 T160 Fold Change Cluster G1(n=300) T0 T10 T20 T30 T40 T50 T60 T70 T80 T90 T100 T110 T120 T130 T140 T150 T160 Time Points Cluster S (n=71) Time Points T0 T10 T20 T30 T40 T50 T60 T70 T80 T90 T100 T110 T120 T130 T140 T150 T160 Fold Change Cluster S/G2 (n=121) Time Points Cluster M/G1 (n=113) Time Points Fold Change Fold Change Fold Change

13 A visualization of the five Spellman peak phase clusters displayed as a sequence of sixteen histograms for each cell cycle.

14 A Radviz visualization of the yeast cell cycle data, clustered using the time and colored by Spellman s peak phase classification. This visualization technique employs the physical concept of spring forces to position the multi-dimensional data.

15 A dendogram visualization displaying a user selected cluster generated by a standard hierarchical clustering method

16 A K-means clustering of the Spellman data. The visualization features a relative neighborhood graph (minimum set of lines that connect the centroids) and the outliers for all five K-means clusters.

17 A plot displaying the results of a Kohonen self-organizing map generated from the Spellman data, with the classification from Cho overlaid.

18 The statistical results of a self-organizing feature map trained on the Spellman data. Blue lines display the cluster centroids and the red lines show the standard deviations.

19 Comparison of a K-means clustering technique that generates 30 clusters with the five expression patterns designated by Spellman. While some of the 30 clusters represent subsets of a Spellman class (such as the yellow lines), other clusters have genes that fall into two or more Spellman Classes.

20 A comparison of two clustering techniques using a jittered scatterplot of the Spellman data. Five clusters from one technique (along the Y-axis) are compared with 12 clusters from another technique (along the X-axis). If the X-axis clusters were a pure superset of the Y-axis clusters then there would only be one clump per vertical line. In this case only the 12 th cluster on the X-axis is pure while the 1 st is nearly so.

21 A circle segment visualization comparing the results of different classification techniques. The true class is represented in color, while the predicted class is represented with a grayscale. If the change in grayscale value matches the change in color, then there is a strong correlation between the true and predicted class. In this example the "cl03" correlates well with the true class feature, the peak.

22 Comparing Clustering Techniques Rank Clustering Data Number of %correct %correct %correct %correct %correct Technique Clusters method -1 method -2 method -3 method -4 maximum 1 Kohonen 3 Norm Kohonen 1 Norm Kohonen 2 Norm C K-means 1 Norm SOM 4 Original SOM 12 Original Kohonen 2 Original C K-means 1 Original Kohonen 1 Original Kohonen 3 Original C K-means 2 Norm SOM 7 Norm M K-means 1 Original Dendogram 2 Original K-means 2 Original SOM 7 Original Dendogram 1 Original SOM 12 Norm M K-means 2 Original M K-means 3 Original random Original The results of several clustering techniques were compared to the five Spellman classifications (G2/M, G1, S, S/G2, and M/G1). For a given technique, each generated cluster was considered to be a subset of one of the Spellman classes. The class chosen for each cluster was based on the majority of Spellman classes for the genes in that cluster. After each cluster was categorized, the resulting accuracies were calculated. The total percent correct and the average accuracy for each class was calculated and is presented in the method columns.

23 Unsupervised Clustering of Yeast Gene Expression Data In the Cho paper, 416 genes were visually identified as cell cycle regulated. In the Spellman paper, the Cho data was combined with the results from other experiments and 800 genes were identified algorithmically as cell cycle regulated. In the following examples, we apply various unsupervised clustering techniques to a subset of the Cho dataset (the 800 genes that were identified in). The first row (images 1-3) consists of visualizations of the original data (gene expression levels during two cell cycles). The second row (images 4-6) visually presents the results of several clustering algorithms. The third row (images 7-9) displays the statistical properties of each cluster generated by various algorithms. The fourth row (images 10-12) provides visual comparisons between selected clustering algorithms.

24 References Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell Dec; 9(12): Cho RJ, Campbell MJ, Winzeler EA, Steinmetz L, Conway A, Wodicka L, Wolfsberg TG, Gabrielian AE, Landsman D, Lockhart DJ, Davis RW. A genome-wide transcriptional analysis of the mitotic cell cycle. Molecular Cell 2: 65-73, DeRisi JL, Iyer VR, Brown PO. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science Oct 24; 278(5338):

Visualizing Gene Clusters using Neighborhood Graphs in R

Visualizing Gene Clusters using Neighborhood Graphs in R Theresa Scharl & Friedrich Leisch Visualizing Gene Clusters using Neighborhood Graphs in R Technical Report Number 16, 2008 Department of Statistics University of Munich http://www.stat.uni-muenchen.de

More information

Reconstructing Boolean Networks from Noisy Gene Expression Data

Reconstructing Boolean Networks from Noisy Gene Expression Data 2004 8th International Conference on Control, Automation, Robotics and Vision Kunming, China, 6-9th December 2004 Reconstructing Boolean Networks from Noisy Gene Expression Data Zheng Yun and Kwoh Chee

More information

Introduction to Mfuzz package and its graphical user interface

Introduction to Mfuzz package and its graphical user interface Introduction to Mfuzz package and its graphical user interface Matthias E. Futschik SysBioLab, Universidade do Algarve URL: http://mfuzz.sysbiolab.eu and Lokesh Kumar Institute for Advanced Biosciences,

More information

CompClustTk Manual & Tutorial

CompClustTk Manual & Tutorial CompClustTk Manual & Tutorial Brandon King Copyright c California Institute of Technology Version 0.1.10 May 13, 2004 Contents 1 Introduction 1 1.1 Purpose.............................................

More information

Algorithms for Bounded-Error Correlation of High Dimensional Data in Microarray Experiments

Algorithms for Bounded-Error Correlation of High Dimensional Data in Microarray Experiments Algorithms for Bounded-Error Correlation of High Dimensional Data in Microarray Experiments Mehmet Koyutürk, Ananth Grama, and Wojciech Szpankowski Department of Computer Sciences, Purdue University West

More information

Contents. ! Data sets. ! Distance and similarity metrics. ! K-means clustering. ! Hierarchical clustering. ! Evaluation of clustering results

Contents. ! Data sets. ! Distance and similarity metrics. ! K-means clustering. ! Hierarchical clustering. ! Evaluation of clustering results Statistical Analysis of Microarray Data Contents Data sets Distance and similarity metrics K-means clustering Hierarchical clustering Evaluation of clustering results Clustering Jacques van Helden Jacques.van.Helden@ulb.ac.be

More information

Clustering Jacques van Helden

Clustering Jacques van Helden Statistical Analysis of Microarray Data Clustering Jacques van Helden Jacques.van.Helden@ulb.ac.be Contents Data sets Distance and similarity metrics K-means clustering Hierarchical clustering Evaluation

More information

Genomics - Problem Set 2 Part 1 due Friday, 1/26/2018 by 9:00am Part 2 due Friday, 2/2/2018 by 9:00am

Genomics - Problem Set 2 Part 1 due Friday, 1/26/2018 by 9:00am Part 2 due Friday, 2/2/2018 by 9:00am Genomics - Part 1 due Friday, 1/26/2018 by 9:00am Part 2 due Friday, 2/2/2018 by 9:00am One major aspect of functional genomics is measuring the transcript abundance of all genes simultaneously. This was

More information

New Genetic Operators for Solving TSP: Application to Microarray Gene Ordering

New Genetic Operators for Solving TSP: Application to Microarray Gene Ordering New Genetic Operators for Solving TSP: Application to Microarray Gene Ordering Shubhra Sankar Ray, Sanghamitra Bandyopadhyay, and Sankar K. Pal Machine Intelligence Unit, Indian Statistical Institute,

More information

CLUSTERING GENE EXPRESSION DATA USING AN EFFECTIVE DISSIMILARITY MEASURE 1

CLUSTERING GENE EXPRESSION DATA USING AN EFFECTIVE DISSIMILARITY MEASURE 1 International Journal of Computational Bioscience, Vol. 1, No. 1, 2010 CLUSTERING GENE EXPRESSION DATA USING AN EFFECTIVE DISSIMILARITY MEASURE 1 R. Das, D.K. Bhattacharyya, and J.K. Kalita Abstract This

More information

Double Self-Organizing Maps to Cluster Gene Expression Data

Double Self-Organizing Maps to Cluster Gene Expression Data Double Self-Organizing Maps to Cluster Gene Expression Data Dali Wang, Habtom Ressom, Mohamad Musavi, Cristian Domnisoru University of Maine, Department of Electrical & Computer Engineering, Intelligent

More information

Genomics - Problem Set 2 Part 1 due Friday, 1/25/2019 by 9:00am Part 2 due Friday, 2/1/2019 by 9:00am

Genomics - Problem Set 2 Part 1 due Friday, 1/25/2019 by 9:00am Part 2 due Friday, 2/1/2019 by 9:00am Genomics - Part 1 due Friday, 1/25/2019 by 9:00am Part 2 due Friday, 2/1/2019 by 9:00am One major aspect of functional genomics is measuring the transcript abundance of all genes simultaneously. This was

More information

CompClustTk Manual & Tutorial

CompClustTk Manual & Tutorial CompClustTk Manual & Tutorial Brandon King Diane Trout Copyright c California Institute of Technology Version 0.2.0 May 16, 2005 Contents 1 Introduction 1 1.1 Purpose.............................................

More information

Gene Expression Clustering with Functional Mixture Models

Gene Expression Clustering with Functional Mixture Models Gene Expression Clustering with Functional Mixture Models Darya Chudova, Department of Computer Science University of California, Irvine Irvine CA 92697-3425 dchudova@ics.uci.edu Eric Mjolsness Department

More information

Dimension reduction : PCA and Clustering

Dimension reduction : PCA and Clustering Dimension reduction : PCA and Clustering By Hanne Jarmer Slides by Christopher Workman Center for Biological Sequence Analysis DTU The DNA Array Analysis Pipeline Array design Probe design Question Experimental

More information

Package Mfuzz. R topics documented: March 26, Version Date Title Soft clustering of time series gene expression data

Package Mfuzz. R topics documented: March 26, Version Date Title Soft clustering of time series gene expression data Package Mfuzz March 26, 2013 Version 2.16.1 Date 2012-09-20 Title Soft clustering of time series gene expression data Author Matthias Futschik Maintainer Matthias Futschik

More information

An Efficient Optimal Leaf Ordering for Hierarchical Clustering in Microarray Gene Expression Data Analysis

An Efficient Optimal Leaf Ordering for Hierarchical Clustering in Microarray Gene Expression Data Analysis An Efficient Optimal Leaf Ordering for Hierarchical Clustering in Microarray Gene Expression Data Analysis Jianting Zhang Le Gruenwald School of Computer Science The University of Oklahoma Norman, Oklahoma,

More information

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology 9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example

More information

Evaluation and comparison of gene clustering methods in microarray analysis

Evaluation and comparison of gene clustering methods in microarray analysis Evaluation and comparison of gene clustering methods in microarray analysis Anbupalam Thalamuthu 1 Indranil Mukhopadhyay 1 Xiaojing Zheng 1 George C. Tseng 1,2 1 Department of Human Genetics 2 Department

More information

Package Mfuzz. R topics documented: July 4, Version Date

Package Mfuzz. R topics documented: July 4, Version Date Version 2.41.0 Date 2016-10-18 Package Mfuzz July 4, 2018 Title Soft clustering of time series gene expression data Author Matthias Futschik Maintainer Matthias Futschik

More information

Missing Data Estimation in Microarrays Using Multi-Organism Approach

Missing Data Estimation in Microarrays Using Multi-Organism Approach Missing Data Estimation in Microarrays Using Multi-Organism Approach Marcel Nassar and Hady Zeineddine Progress Report: Data Mining Course Project, Spring 2008 Prof. Inderjit S. Dhillon April 02, 2008

More information

Advances in microarray technologies (1 5) have enabled

Advances in microarray technologies (1 5) have enabled Statistical modeling of large microarray data sets to identify stimulus-response profiles Lue Ping Zhao*, Ross Prentice*, and Linda Breeden Divisions of *Public Health Sciences and Basic Sciences, Fred

More information

Mining Microarray Gene Expression Data

Mining Microarray Gene Expression Data Mining Microarray Gene Expression Data Michinari Momma (1) Minghu Song (2) Jinbo Bi (3) (1) mommam@rpi.edu, Dept. of Decision Sciences and Engineering Systems (2) songm@rpi.edu, Dept. of Chemistry (3)

More information

Modes and Clustering for Time-Warped Gene Expression Profile Data

Modes and Clustering for Time-Warped Gene Expression Profile Data Modes and Clustering for Time-Warped Gene Expression Profile Data Xueli Liu and Hans-Georg Müller,. Departments of Human Genetics and Biomathematics, UCLA School of Medicine, Los Angeles, CA 995. Department

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

/ Computational Genomics. Normalization

/ Computational Genomics. Normalization 10-810 /02-710 Computational Genomics Normalization Genes and Gene Expression Technology Display of Expression Information Yeast cell cycle expression Experiments (over time) baseline expression program

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,

More information

Predicting Gene Function and Localization

Predicting Gene Function and Localization Predicting Gene Function and Localization By Ankit Kumar and Raissa Largman CS 229 Fall 2013 I. INTRODUCTION Our data comes from the 2001 KDD Cup Data Mining Competition. The competition had two tasks,

More information

Clustering Techniques

Clustering Techniques Clustering Techniques Bioinformatics: Issues and Algorithms CSE 308-408 Fall 2007 Lecture 16 Lopresti Fall 2007 Lecture 16-1 - Administrative notes Your final project / paper proposal is due on Friday,

More information

Overview. Non-Parametrics Models Definitions KNN. Ensemble Methods Definitions, Examples Random Forests. Clustering. k-means Clustering 2 / 8

Overview. Non-Parametrics Models Definitions KNN. Ensemble Methods Definitions, Examples Random Forests. Clustering. k-means Clustering 2 / 8 Tutorial 3 1 / 8 Overview Non-Parametrics Models Definitions KNN Ensemble Methods Definitions, Examples Random Forests Clustering Definitions, Examples k-means Clustering 2 / 8 Non-Parametrics Models Definitions

More information

Tutorial:OverRepresentation - OpenTutorials

Tutorial:OverRepresentation - OpenTutorials Tutorial:OverRepresentation From OpenTutorials Slideshow OverRepresentation (about 12 minutes) (http://opentutorials.rbvi.ucsf.edu/index.php?title=tutorial:overrepresentation& ce_slide=true&ce_style=cytoscape)

More information

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Yanjun Qi, Ziv Bar-Joseph, Judith Klein-Seetharaman Protein 2006 Motivation Correctly

More information

A NOVEL HYBRID APPROACH TO ESTIMATING MISSING VALUES IN DATABASES USING K-NEAREST NEIGHBORS AND NEURAL NETWORKS

A NOVEL HYBRID APPROACH TO ESTIMATING MISSING VALUES IN DATABASES USING K-NEAREST NEIGHBORS AND NEURAL NETWORKS International Journal of Innovative Computing, Information and Control ICIC International c 2012 ISSN 1349-4198 Volume 8, Number 7(A), July 2012 pp. 4705 4717 A NOVEL HYBRID APPROACH TO ESTIMATING MISSING

More information

How do microarrays work

How do microarrays work Lecture 3 (continued) Alvis Brazma European Bioinformatics Institute How do microarrays work condition mrna cdna hybridise to microarray condition Sample RNA extract labelled acid acid acid nucleic acid

More information

Biclustering Bioinformatics Data Sets. A Possibilistic Approach

Biclustering Bioinformatics Data Sets. A Possibilistic Approach Possibilistic algorithm Bioinformatics Data Sets: A Possibilistic Approach Dept Computer and Information Sciences, University of Genova ITALY EMFCSC Erice 20/4/2007 Bioinformatics Data Sets Outline Introduction

More information

PROBLEM 4

PROBLEM 4 PROBLEM 2 PROBLEM 4 PROBLEM 5 PROBLEM 6 PROBLEM 7 PROBLEM 8 PROBLEM 9 PROBLEM 10 PROBLEM 11 PROBLEM 12 PROBLEM 13 PROBLEM 14 PROBLEM 16 PROBLEM 17 PROBLEM 22 PROBLEM 23 PROBLEM 24 PROBLEM 25

More information

Radmacher, M, McShante, L, Simon, R (2002) A paradigm for Class Prediction Using Expression Profiles, J Computational Biol 9:

Radmacher, M, McShante, L, Simon, R (2002) A paradigm for Class Prediction Using Expression Profiles, J Computational Biol 9: Microarray Statistics Module 3: Clustering, comparison, prediction, and Go term analysis Johanna Hardin and Laura Hoopes Worksheet to be handed in the week after discussion Name Clustering algorithms:

More information

A Hybrid Algorithm for K-medoid Clustering of Large Data Sets

A Hybrid Algorithm for K-medoid Clustering of Large Data Sets A Hybrid Algorithm for K-medoid Clustering of Large Data Sets Weiguo Sheng Department of Information System and Computing, Brunel University, UBX 3PH London, UK Email: weiguo.sheng@brunel.ac.uk Abstract-ln

More information

MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS

MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS Mathematical and Computational Applications, Vol. 5, No. 2, pp. 240-247, 200. Association for Scientific Research MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS Volkan Uslan and Đhsan Ömür Bucak

More information

What is clustering. Organizing data into clusters such that there is high intra- cluster similarity low inter- cluster similarity

What is clustering. Organizing data into clusters such that there is high intra- cluster similarity low inter- cluster similarity Clustering What is clustering Organizing data into clusters such that there is high intra- cluster similarity low inter- cluster similarity Informally, finding natural groupings among objects. High dimensional

More information

GPU Accelerated PK-means Algorithm for Gene Clustering

GPU Accelerated PK-means Algorithm for Gene Clustering GPU Accelerated PK-means Algorithm for Gene Clustering Wuchao Situ, Yau-King Lam, Yi Xiao, P.W.M. Tsang, and Chi-Sing Leung Department of Electronic Engineering, City University of Hong Kong, Hong Kong,

More information

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748 CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs07.html 3/3/08 CAP5510 1 Gene g Probe 1 Probe 2 Probe N 3/3/08 CAP5510

More information

Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA

More information

Supervised vs unsupervised clustering

Supervised vs unsupervised clustering Classification Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful

More information

Incorporating Known Pathways into Gene Clustering Algorithms for Genetic Expression Data

Incorporating Known Pathways into Gene Clustering Algorithms for Genetic Expression Data Incorporating Known Pathways into Gene Clustering Algorithms for Genetic Expression Data Ryan Atallah, John Ryan, David Aeschlimann December 14, 2013 Abstract In this project, we study the problem of classifying

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Improving the Efficiency of Fast Using Semantic Similarity Algorithm International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year

More information

Use of biclustering for missing value imputation in gene expression data

Use of biclustering for missing value imputation in gene expression data ORIGINAL RESEARCH Use of biclustering for missing value imputation in gene expression data K.O. Cheng, N.F. Law, W.C. Siu Department of Electronic and Information Engineering, The Hong Kong Polytechnic

More information

R (2) Data analysis case study using R for readily available data set using any one machine learning algorithm.

R (2) Data analysis case study using R for readily available data set using any one machine learning algorithm. Assignment No. 4 Title: SD Module- Data Science with R Program R (2) C (4) V (2) T (2) Total (10) Dated Sign Data analysis case study using R for readily available data set using any one machine learning

More information

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors Dejan Sarka Anomaly Detection Sponsors About me SQL Server MVP (17 years) and MCT (20 years) 25 years working with SQL Server Authoring 16 th book Authoring many courses, articles Agenda Introduction Simple

More information

Using Google s PageRank Algorithm to Identify Important Attributes of Genes

Using Google s PageRank Algorithm to Identify Important Attributes of Genes Using Google s PageRank Algorithm to Identify Important Attributes of Genes Golam Morshed Osmani Ph.D. Student in Software Engineering Dept. of Computer Science North Dakota State Univesity Fargo, ND 58105

More information

Clustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017

Clustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017 Clustering and Dimensionality Reduction Stony Brook University CSE545, Fall 2017 Goal: Generalize to new data Model New Data? Original Data Does the model accurately reflect new data? Supervised vs. Unsupervised

More information

Weka ( )

Weka (  ) Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised

More information

CHAPTER 5 CLUSTER VALIDATION TECHNIQUES

CHAPTER 5 CLUSTER VALIDATION TECHNIQUES 120 CHAPTER 5 CLUSTER VALIDATION TECHNIQUES 5.1 INTRODUCTION Prediction of correct number of clusters is a fundamental problem in unsupervised classification techniques. Many clustering techniques require

More information

Validating Clustering for Gene Expression Data

Validating Clustering for Gene Expression Data Validating Clustering for Gene Expression Data Ka Yee Yeung David R. Haynor Walter L. Ruzzo Technical Report UW-CSE-00-01-01 January, 2000 Department of Computer Science & Engineering University of Washington

More information

Package ctc. R topics documented: August 2, Version Date Depends amap. Title Cluster and Tree Conversion.

Package ctc. R topics documented: August 2, Version Date Depends amap. Title Cluster and Tree Conversion. Package ctc August 2, 2013 Version 1.35.0 Date 2005-11-16 Depends amap Title Cluster and Tree Conversion. Author Antoine Lucas , Laurent Gautier biocviews Microarray,

More information

Constructing Bayesian Network Models of Gene Expression Networks from Microarray Data

Constructing Bayesian Network Models of Gene Expression Networks from Microarray Data Constructing Bayesian Network Models of Gene Expression Networks from Microarray Data Peter Spirtes a, Clark Glymour b, Richard Scheines a, Stuart Kauffman c, Valerio Aimale c, Frank Wimberly c a Department

More information

Redefining and Enhancing K-means Algorithm

Redefining and Enhancing K-means Algorithm Redefining and Enhancing K-means Algorithm Nimrat Kaur Sidhu 1, Rajneet kaur 2 Research Scholar, Department of Computer Science Engineering, SGGSWU, Fatehgarh Sahib, Punjab, India 1 Assistant Professor,

More information

Introduction to Bioinformatics AS Laboratory Assignment 2

Introduction to Bioinformatics AS Laboratory Assignment 2 Introduction to Bioinformatics AS 250.265 Laboratory Assignment 2 Last week, we discussed several high-throughput methods for the analysis of gene expression in cells. Of those methods, microarray technologies

More information

DI TRANSFORM. The regressive analyses. identify relationships

DI TRANSFORM. The regressive analyses. identify relationships July 2, 2015 DI TRANSFORM MVstats TM Algorithm Overview Summary The DI Transform Multivariate Statistics (MVstats TM ) package includes five algorithm options that operate on most types of geologic, geophysical,

More information

A STUDY ON DYNAMIC CLUSTERING OF GENE EXPRESSION DATA

A STUDY ON DYNAMIC CLUSTERING OF GENE EXPRESSION DATA STUDIA UNIV. BABEŞ BOLYAI, INFORMATICA, Volume LIX, Number 1, 2014 A STUDY ON DYNAMIC CLUSTERING OF GENE EXPRESSION DATA ADELA-MARIA SÎRBU Abstract. Microarray and next-generation sequencing technologies

More information

CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA. By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr.

CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA. By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr. CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr. Michael Nechyba 1. Abstract The objective of this project is to apply well known

More information

Unsupervised Learning I: K-Means Clustering

Unsupervised Learning I: K-Means Clustering Unsupervised Learning I: K-Means Clustering Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp. 487-515, 532-541, 546-552 (http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf)

More information

Chapter 1. Using the Cluster Analysis. Background Information

Chapter 1. Using the Cluster Analysis. Background Information Chapter 1 Using the Cluster Analysis Background Information Cluster analysis is the name of a multivariate technique used to identify similar characteristics in a group of observations. In cluster analysis,

More information

EMMA: An EM-based Imputation Technique for Handling Missing Sample-Values in Microarray Expression Profiles.

EMMA: An EM-based Imputation Technique for Handling Missing Sample-Values in Microarray Expression Profiles. EMMA: An EM-based Imputation Technique for Handling Missing Sample-Values in Microarray Expression Profiles. Amitava Karmaker 1 *, Edward A. Salinas 2, Stephen Kwek 3 1 University of Wisconsin-Stout, Menomonie,

More information

Machine Learning in Biology

Machine Learning in Biology Università degli studi di Padova Machine Learning in Biology Luca Silvestrin (Dottorando, XXIII ciclo) Supervised learning Contents Class-conditional probability density Linear and quadratic discriminant

More information

An integrated tool for microarray data clustering and cluster validity assessment

An integrated tool for microarray data clustering and cluster validity assessment An integrated tool for microarray data clustering and cluster validity assessment Nadia Bolshakova Department of Computer Science Trinity College Dublin Ireland +353 1 608 3688 Nadia.Bolshakova@cs.tcd.ie

More information

Supervised vs.unsupervised Learning

Supervised vs.unsupervised Learning Supervised vs.unsupervised Learning In supervised learning we train algorithms with predefined concepts and functions based on labeled data D = { ( x, y ) x X, y {yes,no}. In unsupervised learning we are

More information

Introduction to Data Science. Introduction to Data Science with Python. Python Basics: Basic Syntax, Data Structures. Python Concepts (Core)

Introduction to Data Science. Introduction to Data Science with Python. Python Basics: Basic Syntax, Data Structures. Python Concepts (Core) Introduction to Data Science What is Analytics and Data Science? Overview of Data Science and Analytics Why Analytics is is becoming popular now? Application of Analytics in business Analytics Vs Data

More information

Nature Publishing Group

Nature Publishing Group Figure S I II III 6 7 8 IV ratio ssdna (S/G) WT hr hr hr 6 7 8 9 V 6 6 7 7 8 8 9 9 VII 6 7 8 9 X VI XI VIII IX ratio ssdna (S/G) rad hr hr hr 6 7 Chromosome Coordinate (kb) 6 6 Nature Publishing Group

More information

Robust PDF Table Locator

Robust PDF Table Locator Robust PDF Table Locator December 17, 2016 1 Introduction Data scientists rely on an abundance of tabular data stored in easy-to-machine-read formats like.csv files. Unfortunately, most government records

More information

A Frequent Itemset Nearest Neighbor Based Approach for Clustering Gene Expression Data

A Frequent Itemset Nearest Neighbor Based Approach for Clustering Gene Expression Data A Frequent Itemset Nearest Neighbor Based Approach for Clustering Gene Expression Data Rosy Das, D. K. Bhattacharyya and J. K. Kalita Department of Computer Science and Engineering Tezpur University, Tezpur,

More information

Estimating Error-Dimensionality Relationship for Gene Expression Based Cancer Classification

Estimating Error-Dimensionality Relationship for Gene Expression Based Cancer Classification 1 Estimating Error-Dimensionality Relationship for Gene Expression Based Cancer Classification Feng Chu and Lipo Wang School of Electrical and Electronic Engineering Nanyang Technological niversity Singapore

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,

More information

Package RobustRankAggreg

Package RobustRankAggreg Type Package Package RobustRankAggreg Title Methods for robust rank aggregation Version 1.1 Date 2010-11-14 Author Raivo Kolde, Sven Laur Maintainer February 19, 2015 Methods for aggregating ranked lists,

More information

Model-Based Clustering and Data Transformations for Gene Expression Data

Model-Based Clustering and Data Transformations for Gene Expression Data To appear, Bioinformatics and The Third Georgia Tech-Emory International Conference on Bioinformatics Model-Based Clustering and Data Transformations for Gene Expression Data Yeung, K. Y. y Fraley, C.

More information

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo CSE 6242 A / CS 4803 DVA Feb 12, 2013 Dimension Reduction Guest Lecturer: Jaegul Choo CSE 6242 A / CS 4803 DVA Feb 12, 2013 Dimension Reduction Guest Lecturer: Jaegul Choo Data is Too Big To Do Something..

More information

Data mining techniques for actuaries: an overview

Data mining techniques for actuaries: an overview Data mining techniques for actuaries: an overview Emiliano A. Valdez joint work with Banghee So and Guojun Gan University of Connecticut Advances in Predictive Analytics (APA) Conference University of

More information

Correlation Motif Vignette

Correlation Motif Vignette Correlation Motif Vignette Hongkai Ji, Yingying Wei October 30, 2018 1 Introduction The standard algorithms for detecting differential genes from microarray data are mostly designed for analyzing a single

More information

Parallel Coordinates ++

Parallel Coordinates ++ Parallel Coordinates ++ CS 4460/7450 - Information Visualization Feb. 2, 2010 John Stasko Last Time Viewed a number of techniques for portraying low-dimensional data (about 3

More information

Biological Networks Analysis

Biological Networks Analysis Biological Networks Analysis Introduction and Dijkstra s algorithm Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein The clustering problem: partition genes into distinct

More information

ClaNC: The Manual (v1.1)

ClaNC: The Manual (v1.1) ClaNC: The Manual (v1.1) Alan R. Dabney June 23, 2008 Contents 1 Installation 3 1.1 The R programming language............................... 3 1.2 X11 with Mac OS X....................................

More information

DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES

DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES EXPERIMENTAL WORK PART I CHAPTER 6 DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES The evaluation of models built using statistical in conjunction with various feature subset

More information

Data mining with Support Vector Machine

Data mining with Support Vector Machine Data mining with Support Vector Machine Ms. Arti Patle IES, IPS Academy Indore (M.P.) artipatle@gmail.com Mr. Deepak Singh Chouhan IES, IPS Academy Indore (M.P.) deepak.schouhan@yahoo.com Abstract: Machine

More information

Grid-Layout Visualization Method in the Microarray Data Analysis Interactive Graphics Toolkit

Grid-Layout Visualization Method in the Microarray Data Analysis Interactive Graphics Toolkit Grid-Layout Visualization Method in the Microarray Data Analysis Interactive Graphics Toolkit Li Xiao, Oleg Shats, and Simon Sherman * Nebraska Informatics Center for the Life Sciences Eppley Institute

More information

Mining Gene Expression Data Using PCA Based Clustering

Mining Gene Expression Data Using PCA Based Clustering Vol. 5, No. 1, January-June 2012, pp. 13-18, Published by Serials Publications, ISSN: 0973-7413 Mining Gene Expression Data Using PCA Based Clustering N.P. Gopalan 1 and B. Sathiyabhama 2 * 1 Department

More information

Visual Data Mining. Overview. Apr. 24, 2007 Visual Analytics presentation Julia Nam

Visual Data Mining. Overview. Apr. 24, 2007 Visual Analytics presentation Julia Nam Overview Visual Data Mining Apr. 24, 2007 Visual Analytics presentation Julia Nam Visual Classification: An Interactive Approach to Decision Tree Construction M. Ankerst, C. Elsen, M. Ester, H. Kriegel,

More information

CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008

CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008 CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008 Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute NAME: Prof. Ruiz Problem

More information

THE MANUAL.

THE MANUAL. THE MANUAL Jeffrey T. Leek, Eva Monsen, Alan R. Dabney, and John D. Storey Department of Biostatistics Department of Genome Sciences University of Washington http://faculty.washington.edu/jstorey/edge/

More information

1 Case study of SVM (Rob)

1 Case study of SVM (Rob) DRAFT a final version will be posted shortly COS 424: Interacting with Data Lecturer: Rob Schapire and David Blei Lecture # 8 Scribe: Indraneel Mukherjee March 1, 2007 In the previous lecture we saw how

More information

APPLY DATA CLUSTERING TO GENE EXPRESSION DATA

APPLY DATA CLUSTERING TO GENE EXPRESSION DATA California State University, San Bernardino CSUSB ScholarWorks Electronic Theses, Projects, and Dissertations Office of Graduate Studies 12-2015 APPLY DATA CLUSTERING TO GENE EXPRESSION DATA Abdullah Jameel

More information

Adaptive quality-based clustering of gene expression profiles

Adaptive quality-based clustering of gene expression profiles Adaptive quality-based clustering of gene expression profiles Frank De Smet *, Janick Mathys, Kathleen Marchal, Gert Thijs, Bart De Moor, Yves Moreau ESAT-SISTA/COSIC/DocArch, K.U.Leuven, Kasteelpark Arenberg

More information

Seismic facies analysis using generative topographic mapping

Seismic facies analysis using generative topographic mapping Satinder Chopra + * and Kurt J. Marfurt + Arcis Seismic Solutions, Calgary; The University of Oklahoma, Norman Summary Seismic facies analysis is commonly carried out by classifying seismic waveforms based

More information

BMC Bioinformatics. Open Access. Abstract

BMC Bioinformatics. Open Access. Abstract BMC Bioinformatics BioMed Central Methodology article Methods for simultaneously identifying coherent local clusters with smooth global patterns in gene expression profiles Yin-Jing Tien 1, Yun-Shien Lee

More information

Figures and figure supplements

Figures and figure supplements RESEARCH ARTICLE Figures and figure supplements Comprehensive machine learning analysis of Hydra behavior reveals a stable basal behavioral repertoire Shuting Han et al Han et al. elife 8;7:e35. DOI: https://doi.org/.755/elife.35

More information

FPF-SB: a Scalable Algorithm for Microarray Gene Expression Data Clustering

FPF-SB: a Scalable Algorithm for Microarray Gene Expression Data Clustering FPF-SB: a Scalable Algorithm for Microarray Gene Expression Data Clustering Filippo Geraci 1,3, Mauro Leoncini 2,1, Manuela Montangero 2,1, Marco Pellegrini 1, and M. Elena Renda 1 1 CNR, Istituto di Informatica

More information

A Dendrogram. Bioinformatics (Lec 17)

A Dendrogram. Bioinformatics (Lec 17) A Dendrogram 3/15/05 1 Hierarchical Clustering [Johnson, SC, 1967] Given n points in R d, compute the distance between every pair of points While (not done) Pick closest pair of points s i and s j and

More information

Time Series Gene Expression Data Classification via L 1 -norm Temporal SVM

Time Series Gene Expression Data Classification via L 1 -norm Temporal SVM Time Series Gene Expression Data Classification via L 1 -norm Temporal SVM Carlotta Orsenigo and Carlo Vercellis Dept. of Management, Economics and Industrial Engineering, Politecnico di Milano Via Lambruschini

More information

A Feature Selection Method to Handle Imbalanced Data in Text Classification

A Feature Selection Method to Handle Imbalanced Data in Text Classification A Feature Selection Method to Handle Imbalanced Data in Text Classification Fengxiang Chang 1*, Jun Guo 1, Weiran Xu 1, Kejun Yao 2 1 School of Information and Communication Engineering Beijing University

More information

Random Forest Similarity for Protein-Protein Interaction Prediction from Multiple Sources. Y. Qi, J. Klein-Seetharaman, and Z.

Random Forest Similarity for Protein-Protein Interaction Prediction from Multiple Sources. Y. Qi, J. Klein-Seetharaman, and Z. Random Forest Similarity for Protein-Protein Interaction Prediction from Multiple Sources Y. Qi, J. Klein-Seetharaman, and Z. Bar-Joseph Pacific Symposium on Biocomputing 10:531-542(2005) RANDOM FOREST

More information