Statistical Genomics: Gene Expression Profiling
|
|
- Patrick Logan
- 5 years ago
- Views:
Transcription
1 Statistical Genomics: Gene Epression Profiling Jung-Hsin Lin School of Pharmacy National Taiwan University r.mc.ntu.edu.tw/~jlin/statisticalgenomics.pdf et 8404
2 Microarray Eperiment Samples Further reading: Pharmacogenomics 3: (00). Biopsy (OCT embedding) microtome HE staining or Immunofluorescence staining RNA QC LCM RNA etraction RNA QC Culture cells or whole blood cells ( lysed in Soln.. D) RNA etraction RNA QC Total RNA T7-based RNA amplification (1 woring days) arna Cy3/Cy5 Direct labeling (1 woring day) Cy3/ Cy5 post-labeling (1 woring days) Cy3-/Cy5 /Cy5- target cdna Hybridization: : 17 hours Scanning : 0.5 woring day Data analysis : 0.5 woring days Procedure I : - 3 wees Procedure II & III : 1 wees
3 Worflow of a cdna Microarray EST fragments arrayed in 96- or 384- well plates are spotted at high density onto a glass microscope slide. Subsequently, two different fluorescently labeled cdna populations derived from independent mrna samples are hybridized to the array. After washing, a laser scan the slide and the ratio of induced fluorescence of the two samples is calculated for each EST, which indicates the relative amount of transcript for EST in the samples.
4 Dye-Swap Eperiment Array A Array B
5 Microarray Data Analysis Feature Etraction MIAME Data Visualization Data Clustering Analysis of Variance Epression Profile Comparison Pathway Identification
6 Feature Etraction Algorithms (Agilent) 1. FindSpots and SpotAnalysis algorithms. PolyOutlierFlagger algorithm 3. BGSub (Bacground subtraction) algorithm 4. DyeNorm algorithm 5. Ratio algorithm
7 1. FindSpots and SpotAnalyzer FindSpots algorithms 1) Locate the corners ) Locate the spots SpotAnalyzer algorithms 1) Define features ) Estimate the radius for the local bacground 3) Reject outliers 4) Calculate the mean signal of the feature 5) Calculate the mean signal of the local bacground 6) Determine if the feature is saturated
8 Feature definition with CooieCutter method Feature definition with WholeSpot method
9 . PolyOutlier algorithm 1) Determine if the feature is a non-uniformity outlier ) Determine if the feature is a population outlier
10 3. BGSubtractor algorithm 1) Calculate the feature bacground-subtracted signal Unadjusted bacground-subtracted signals Adjusted bacground-subtracted signals ) Calculate the significance of feature intensity relative to bacground 3) Determine if the feature bacground-subtracted signal is well above the bacground
11 Choices for Bacground Subtraction Methods Local bacground (Default) Average of all bacground areas Average of negative control features Minimum signal (feature or bacground) Minimum signal (feature) on array
12 4. Dye Normalization algorithm 1) Determine normalization features a) Using all significant, non-control, and nonoutlier features method b) Using list of normalization genes method c) Using Ran Consistency Filter method (Default) ) Calculate the normalization factor 3) Calculate the dye normalized signal
13 Scatter Plot of Normalized Cy5/Cy3 Signals
14 5. Ratio algorithm 1) Calculate the surrogate value ) Calculate the processed signal 3) Calculate the log ratio of feature 4) Calculate the p-value p and error on log ratio of feature
15 Log Ratio vs. Feature ID
16 Histogram of Log Ratio Values
17 ArrayMap Upload the array data and the image. Set up criteria for mapping genes on the array image. Display the array image and circle the found feature on the image.
18 More ArrayMap Functions Intensity 'c:\tet.data' 6e+04 5e+04 4e+04 3e+04 e+04 1e X Y
19 Searching Repeating Genes Upload files for across gene array comparison. Find out repeating genes and save searched results in a file.
20 MIAME is a standard that describes minimal information needed to fully annotate a gene epression eperiment and being MIAME has been a prerequisite for publishing your data. To ensure that the data to be MIAMEcompliant. MIAME 1. Eperimental design. Array Design 3. Samples 4. Hybridization 5. Measurements 6. Normalization and controls
21 3D Data Visualization To provide in-depth 3D scatter plot tool and interactive representations of highly comple data. Epression data values or analysis results can be placed on any of the 3 userdefined aes to create a powerful medium for array data presentation.
22 Tree View of Gene Classification To facilitate powerful, easy navigation within dendrograms. To display gene classifications and eperiment parameters on gene and condition trees, respectively. To simplify eamination of clinical and eperimental data in the contet of clustered epression patterns.
23 Visual Filtering To provide a simplified and visually intuitive user interfaces for filtering tools. Real-time generation of graphs of results.
24 Analysis of Variance (ANOVA) To reliably identify differentially epressed genes. To identify genes capable of discriminating between one or more eperimental parameters or sample phenotypes. Groups of genes identified by epression profiling can be further characterized by performing sequence searches for potential regulatory elements
25 Epression Profile Comparison To eplore all of the eperiments related to a single genome in the databases. To identify target epression patterns and quicly find similar epression profiles within all the normalized sample sets database. To characterizing the results of compound screening eperiments and patient treatment studies
26 Pathway Viewer To visually characterize genes and their epression patterns based on their location within a cellular pathway. Users can design their own pathway diagrams or directly import publicly available pathway maps. Users can predict genes associated with discrete steps in the pathway of interest.
27 Affymetri Data Analysis Flow Chart Chip Description Files Hybridization + Scanning EXP File Image analysis DAT File + CDF File CEL File Processing 1. Bacground Correction. Normalization 3. PM Correction 4. Epression Inde GCOS MAS CHP File Intensity value Absent / Present call RMA Tet File Probe ID + Log (Intensity) RPT File Report File, quality Ecel File
28 Affymetri Data Files : DAT Files.*DAT file ~ 110 MBytes
29 From DAT to CEL
30 Affymetri Data Files : CDF Files
31 MAS 5.0 Analysis output file (*.CHP)
32 RNA Quality Assessment RNA Degradation Plot
33 Statistical Plots : Bo Plot
34 Statistical Plots
35 Scatter Plot and MA plot
36 MAS 5.0 Epression Report File (.RPT)
37 MAS 5.0 Epression Report File (.RPT)
38 Options for Normalization Levels PM & MM (MAS 5.1), PM-MM (DChip), PM only (RMA) Features All, Ran invariant set, Spie in, houseeeping Methods Complete data: no reference chip, information from all arrays used: Quantiles Normalization, MVA plot + Loess Baseline: normalized using reference chip: MAS 5.0, Li-Wong s Model-based, Q-spline
39 Summarization Methods: MAS Cell intensities are preprocessed for global bacground.. An ideal mismatch value is calculated and subtracted to adjust the PM intensity. 3. The adjusted PM intensities are log-transformed to stabilize the variance. 4. The biwight estimator is used to provide a robust mean of the resulting values. Signal is it output as the antilog of the resulting value. 5. Finally, Signal is scaled using a trimmed mean. Probe value the probe value PV for every probe pair j in prob set i. V Signal is calculated as follows: i, j PV ij ma( PM IM d) default δ log ( V ij i, j ), ij j 1,..., n n is the number of probe pairs in the probe set. -0 SignalLogV aluei Tbi ( PVi,1,..., PVi, e ) i target signal sf SignalLogValuei TrimMean(,0.0,0.98 ) TrimMean(SPVbi,0.0,0.98 ) nf TrimMean(SPVe,0.0,0.98 ) ReportedValue( i ) nf sf i ( SignalLogValuei )
40 Summarization Methods: RMA Medianpolish This is the summarization used in the RMA epression summary. A multichip linear fit model is fit to data from each probe set. The medianpolish is an algorithm for fitting this model robustly. It should be noted that the calculated epression values using this summary measures will be in log scale. For a probe set with i 1,, I probes and data from j1,.j arrays, fit the following model: log ( ( ) PM ij where α is β i j is ) a α + β + ε ( ) i probe effect the log ( ) j and ( ) ij epression value
41 Software Shareware/Freeware Bioconductor (R. Gentleman) DNA-Chip Analyser (dchip v1.3) (Li and Wong) RMAEpress: : a simple standalone GUI program for windows for computing the RMA epression measure. Commmerical Affymetri Microarray Suite (MAS 5.1) Affymetri GeneChip Operating Software (GCOS v1.)
42 Gene Ontology GO-Annotator The GO collaborators are developing three structured, controlled vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner.
43 GO-Annota Annotated ted GeneChip Result
44 Reference
45 Random Variables Probability density function (pdf( pdf): f ( ) P( X ) Cumulative distribution function (cdf( cdf): F( ) P( X )
46 The Normal (Gaussian) Distribution f ( μ ) 1 ( ) σ σ The distribution is symmetric around the mean. Approimately 68% of the values are within standard deviations from the mean. Approimately 95% of the values are within 4 standard deviations from the mean. Approimately 99% of the values are within 6 standard deviations from the mean. The infleion points of the curve occurs at μ ±σ π e
47 Mean N μ i 1 N X i N: number of all population X n i 1 n X i n: number of the sample, a subset of all population
48 Mode, Median & Percentile Mode: the value that occurs most often in a data set Median: the value situated in the middle of the ordered list of data P-th Percentile: the value that has p% of members below it and 100-p p % of members above it.
49 Range and Variance Range: Variance X ma X min σ N i 1 ( X μ) i N s n i 1 ( X X ) i n 1 Population variance Variance of a sample
50 Z Value Z X μ σ Any normal distribution can be mapped to standard normal distribution. The standard normal distribution has the mean of zero and one of standard deviation. μ Z 0 1 σ Z
51 p ValueValue P( Z > ) 1 P( Z ) The p-value p provides the information about the amount of trust we can place on a decision made using a given threshold. P-value is the probability of maing an error if we choose as the threshold.
52 Student Student s t s t-test for significantly test for significantly different means different means is the significance that two distributions are having different means. ( ) ( ) B A B A A i B j B j A i D N N N N s 1 1 D B A s t, 1, 1 1 ) ( 1 1/ B A t t N N d B t A ν ν ν ν ν ν ( ) ( ) ds s s b a B b a , ) ( 1 ν t A
53 F-test for significantly different test for significantly different variances variances is the significance level at which the hypothesis 1 has smaller variance than can be rejected. A small numerical value implies a very significant rejection, in turn implying high confidence in the hypothesis that 1 has variance great or equal to. ) var( ) var( B A F 1 1,,, ), ( B A F v N N I F Q ν ν ν ν ν ν ν ν ( ) ds s s b a B b a I b a ), ( 1 ), ( ), ( ν 1 ν F Q
54 Corrected sum of squared and Standard deviation Corrected sum of squared (CSS): Standard deviation n ( X ) i X i 1 σ N i 1 ( X μ) i N s n i 1 ( X X ) i n 1 Population standard deviation Sample standard deviation
55 Covariance and Correlation Covariance and Correlation Covariance: Covariance: (Pearson) correlation coefficient (Pearson) correlation coefficient ( )( ) ( ) ( ) n i i n i i n i i i Y X XY Y Y X X Y Y X X s s Y X Cov ), ( ρ ( )( ) 1 ), ( 1 n Y Y X X Y X Cov n i i i
56 Covariance Matri and Correlation Matri Covariance Matri: σ1 σ1 σ13 K σ 1 σ σ 3 Σ M σ 1 σ σ 3 (Pearson) Correlation Matri σ σ σ 1 σ ij ρ ij σ i σ j
57 Principal Component Analysis Principal Component Analysis () ( ) () ( ) t t j j t i i ij q t q q t q C CT T Λ T Covariance Matri λ f λ λ λ O 3 1 Λ [ ] v f,, v, v, v T 3 1 L i i i v Cv λ
58 Why Principal Component Analysis? There are usually too many degrees of freedom in a comple system. Importance of variables is evaluated statistically. Dimensionality (or degrees of freedom) can be significantly reduced by only looing at most important new coordinates. The first principal component is the normalized linear combination with maimum variance. Principal component is the characteristic vectors of the covariance matri.
59 Clustering Methods: Agglomerative Method (Bottom-Up): proceeded by series of fusions of the n objects into groups. More commonly used. Divisive Method (Top-Down): proceeded by series of partition into finger groups.
60 Anything can be clustered Real Fae
61 Dendrogram A two-dimensional diagram which illustrates the fusions or divisions made at each successive stage of clustering process.
62 Agglomerative methods According to different ways of defining distances (similarities), we have some popular agglomerative methods as follows: Single linage clustering Complete linage clustering Average linage clustering Average group linage Ward s s hierarchical clustering method
63 Single linage clustering Also nown as nearest neighbor technique. Defining feature: distance between groups, D(r,s), is defined as the distance between the closest pair of objects, where only pairs consisting of one object from each group are considered. D(r, s) Min { d ( i, j) : i r, j s} The distance between every possible object pair (i,j) is computed, where object i is in cluster r and object j is in cluster s. The minimum value of these distances is said to be the distance between clusters r and s. In other words, the distance between two clusters is given by the value of the shortest lin between the clusters. At each stage of hierarchical clustering, the clusters r and s, for which D(r,s) is minimum, are merged.
64 Complete linage clustering Also nown as farthest neighbor technique, opposite of single linage. Defining feature: distance between groups, D(r,s), is defined as the distance between the most distant pair of objects, where only pairs consisting of one object from each group are considered. D(r, s) Ma{ d( i, j) : i r, j s} The distance between every possible object pair (i,j) is computed, where object i is in cluster r and object j is in cluster s. The maimum value of these distances is said to be the distance between clusters r and s. In other words, the distance between two clusters is given by the value of the longest lin between the clusters. At each stage of hierarchical clustering, the clusters r and s, for which D(r,s) is maimum, are merged.
65 Average linage clustering Defining feature: distance between groups, D(r,s), is defined as the average distance between all pair of objects, where each pair is made up of one object from each group. D(r, s) Trs /(Nr N s ) where T rs is the sum of all pairwise distances between cluster r and cluster s. N r and N s are the sizes of the clusters r and s respectively. At each stage of hierarchical clustering, the clusters r and s, for which D(r,s) is the minimum, are merged.
66 Average group linage Defining feature: groups once formed are represented by their mean values of each variable, i.e., their mean vector, and intergroup distance, D(r,s), is defined in terms of distance between two such mean vectors. The two clusters r and s are merged such that, after merger, the average pairwise distance within the newly formed cluster, is minimum. Suppose we label the new cluster formed by merging clusters r and s, as t. Then D(r,s), the distance between clusters r and s is computed as D(r, s) Average { d ( i, j) : i & j t, t r s} At each stage of hierarchical clustering, the clusters r and s, for which D(r,s) is minimum, are merged. In this case, those two clusters are merged such that the newly formed cluster, on average, will have minimum pairwise distances between the points in it.
67 -means clustering -means clustering algorithm is an algorithm for partitioning (or clustering) N data points into disjoint subsets S j containing N j data points so as to minimize the sum-of-squares criterion: H j 1 n S j n μ j where n is a vector representing the nth data point and μ n is the geometric centroid of the data points in S j. Algorithm: 1. The data points are assigned at random to the sets.. The centroid is computed for each set, and H is evaluated. Repeat these two steps until H has reached its minimum.
68 Profile Plots associated with -means clustering
69 Self-organizing map (SOM) clustering The Self-Organizing Map (SOM) was introduced by Teuvo Kohonen in 198. The SOM (also nown as the Kohonen feature map) algorithm is one of the best nown artificial neural networ algorithms. In contrast to many other neural networs using supervised learning, the SOM is based on unsupervised learning. Self-adaptive topological maps were initially inspired by modelling perception systems found in mammalian brain. A perception system involves the reception of eternal signals and their processing inside the nervous system. The comple mammalian sills, such as seeing and hearing seemed to bear similarity to each other in the way they wored. Namely, the primary characteristic of these systems is that neighbouring neurons encode input signals which are similar to each other. The SOM is quite a unique ind of neural networ in the sense that it constructs a topology preserving mapping from the high-dimensional space onto map units in such a way that relative distances between data points are preserved. The map units, or neurons, usually form a twodimensional regular lattice where the location of a map unit carries semantic information. The SOM can thus serve as a clustering tool of highdimensional data. Because of its typical two-dimensional shape, it is also easy to visualize. Another important feature of the SOM is its capability to generalize. In other words, it can interpolate between previously encountered inputs.
70 Outline of the SOM algorithm Outline of the SOM algorithm The SOM defines a mapping from high dimensional input data space onto a regular two-dimensional array of neurons. Every neuron i of the map is associated with an n-dimensional reference vector m i [m i1,,m in ] T, where n denotes the dimension of the input vectors. The reference vectors together form a codeboo. The neurons of the map are connected to adjacent neurons by a neighbourhood relation, which dictates the topology, or the structure, of the map. The most common topologies in use are rectangular and heagonal. Adjacent neurons belong to the neighbourhood N i of the neuron i. In the basic SOM algorithm, the topology and the number of neurons remain fied from the beginning. The number of neurons determines the granularity of the mapping, which has an effect on the accuracy and generalization of the SOM. During the training phase, the SOM forms an elastic net that folds onto the "cloud" formed by input data. The algorithm controls the net so that it strives to approimate the density of the data. The reference vectors in the codeboo drift to the areas where the density of the input data is high. Eventually, only few codeboo vectors lie in areas where the input data is sparse.
71 The SOM learning process 1. One sample vector is randomly drawn from the input data set and its similarity (distance) to the codeboo vectors is computed by using e.g. the common Euclidean distance measure: m min { } c m i i. After the Best Matching Unit (BMU) has been found, the codeboo vectors are updated. The BMU itself as well as its topological neighbours are moved closer to the input vector in the input space i.e. the input vector attracts them. The magnitude of the attraction is governed by the learning rate. As the learning proceeds and new input vectors are given to the map, the learning rate gradually decreases to zero according to the specified learning rate function type. Along with the learning rate, the neighbourhood radius decreases as well. The update rule for the reference vector of unit i is the following: mi mi( t + 1) ( t) + α( t) [ ( t) m ( t) ] m ( t), i i N i N 3. The steps 1 and together constitute a single training step and they are repeated until the training ends. The number of training steps must be fied prior to training the SOM because the rate of convergence in the neighbourhood function and the learning rate is calculated accordingly. i c, ( t) c ( t)
72 Dimensionality reduction by SOM In this eample, SOM projects the input space with four dimension into a feature space with only dimension.
73 Gene clustering using SOM Using SOM, two genes that are plotted net to each other are necessarily similar according to the chose distance metric. 1D feature map.
74 Eisen Eisen s Hierarchical Clustering Method Hierarchical Clustering Method ( ) ( ) N t j j j N t i i i N t j j N t i i j j j N t i i i ij E t E N E t E N t E N E t E N E E t E E t E N c ) ( 1, ) ( 1 ) ( 1, ) ( 1 ) ( ) ( 1 σ σ σ σ 1. Construct a matri of similarity measures between all pairs of genes.. Recursively cluster the genes into a tree-lie hierarchy. 3. Determine the boundaries between individual genes. a) Find the pair of clusters with the highest correlation and combine the pair into a single cluster. b) Update the correlation matri using the average values of the newly combined cluster. c) Repeat step (a) and (b) N-1 times until all genes have been clustered. Eisen et al., Proc. Natl. Acad. Sci. USA. 95: (1998)
75 Protein-Protein Interaction Networ Biocarta (
76 Pathway Viewer To visually characterize genes and their epression patterns based on their location within a cellular pathway. Users can design their own pathway diagrams or directly import publicly available pathway maps. Users can predict genes associated with discrete steps in the pathway of interest.
77
78
79
80
81
82
83
84 Modeling Metabolic Insulin Signaling Pathways Am J Physiol Endocrinol Metab 83: E084-E1101, 00
85 Definition of Variables 1 Insulin input Concentration of unbound surface insulin receptors 3 Concentration of unphosphorylated once-bound surface receptors 4 Concentration of phosphorylated twice-bound surface receptors 5 Concentration of phosphorylated once-bound surface receptors 6 Concentration of unbound unphosphorylated intracellular receptors 7 Concentration of phosphorylated twice-bound intracellular receptors 8 Concentration of phosphorylated once-bound intracellular receptors 9 Concentration of unphosphorylated IRS-1 10 Concentration of tyrosine-phosphorylated IRS-1 11 Concentration of unactivated PI 3-inase 1 Concentration of tyrosine-phosphorylated IRS-1/activated PI 3-inase comple 13 Percentage of PI(3,4,5)P3 out of the total lipid population 14 Percentage of PI(4,5)P out of the total lipid population 15 Percentage of PI(3,4)P out of the total lipid population 16 Percentage of unactivated At 17 Percentage of activated At 18 Percentage of unactivated PKC- 19 Percentage of activated PKC- 0 Percentage of intracellular GLUT4 1 Percentage of cell surface GLUT4
86 Mathematical Formulation Mathematical Formulation p p ) [PTP] ( ) ) /(IR ( ) ) /(IR ( [PTP] [PTP] [PTP] ) [PTP]( [PTP] [PTP] dt d dt d dt d dt d dt d dt d dt d dt d dt d dt d ' ' ) ( ) ( [SHIP] [PTEN] [SHIP]) [PTEN] ( dt d dt d dt d dt d dt d dt d dt d dt d dt d dt d
Gene Clustering & Classification
BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering
More information/ Computational Genomics. Normalization
10-810 /02-710 Computational Genomics Normalization Genes and Gene Expression Technology Display of Expression Information Yeast cell cycle expression Experiments (over time) baseline expression program
More informationStatistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte
Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,
More informationUnsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi
Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which
More informationCLUSTERING IN BIOINFORMATICS
CLUSTERING IN BIOINFORMATICS CSE/BIMM/BENG 8 MAY 4, 0 OVERVIEW Define the clustering problem Motivation: gene expression and microarrays Types of clustering Clustering algorithms Other applications of
More informationExploratory data analysis for microarrays
Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA
More informationGiri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748
CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs07.html 3/3/08 CAP5510 1 Gene g Probe 1 Probe 2 Probe N 3/3/08 CAP5510
More information10701 Machine Learning. Clustering
171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among
More informationCluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010
Cluster Analysis Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX 7575 April 008 April 010 Cluster Analysis, sometimes called data segmentation or customer segmentation,
More informationMicroarray Data Analysis (VI) Preprocessing (ii): High-density Oligonucleotide Arrays
Microarray Data Analysis (VI) Preprocessing (ii): High-density Oligonucleotide Arrays High-density Oligonucleotide Array High-density Oligonucleotide Array PM (Perfect Match): The perfect match probe has
More informationSYDE Winter 2011 Introduction to Pattern Recognition. Clustering
SYDE 372 - Winter 2011 Introduction to Pattern Recognition Clustering Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 5 All the approaches we have learned
More informationClustering Techniques
Clustering Techniques Bioinformatics: Issues and Algorithms CSE 308-408 Fall 2007 Lecture 16 Lopresti Fall 2007 Lecture 16-1 - Administrative notes Your final project / paper proposal is due on Friday,
More informationCourse on Microarray Gene Expression Analysis
Course on Microarray Gene Expression Analysis ::: Normalization methods and data preprocessing Madrid, April 27th, 2011. Gonzalo Gómez ggomez@cnio.es Bioinformatics Unit CNIO ::: Introduction. The probe-level
More informationUnsupervised Learning and Clustering
Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)
More informationNetwork Traffic Measurements and Analysis
DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,
More informationIntroduction to GE Microarray data analysis Practical Course MolBio 2012
Introduction to GE Microarray data analysis Practical Course MolBio 2012 Claudia Pommerenke Nov-2012 Transkriptomanalyselabor TAL Microarray and Deep Sequencing Core Facility Göttingen University Medical
More informationOutline. Advanced Digital Image Processing and Others. Importance of Segmentation (Cont.) Importance of Segmentation
Advanced Digital Image Processing and Others Xiaojun Qi -- REU Site Program in CVIP (7 Summer) Outline Segmentation Strategies and Data Structures Algorithms Overview K-Means Algorithm Hidden Markov Model
More informationClustering CS 550: Machine Learning
Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf
More informationClustering and Visualisation of Data
Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some
More informationExploratory Data Analysis using Self-Organizing Maps. Madhumanti Ray
Exploratory Data Analysis using Self-Organizing Maps Madhumanti Ray Content Introduction Data Analysis methods Self-Organizing Maps Conclusion Visualization of high-dimensional data items Exploratory data
More informationCLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS
CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of
More informationMethods for Intelligent Systems
Methods for Intelligent Systems Lecture Notes on Clustering (II) Davide Eynard eynard@elet.polimi.it Department of Electronics and Information Politecnico di Milano Davide Eynard - Lecture Notes on Clustering
More informationCluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1
Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods
More informationDimension reduction : PCA and Clustering
Dimension reduction : PCA and Clustering By Hanne Jarmer Slides by Christopher Workman Center for Biological Sequence Analysis DTU The DNA Array Analysis Pipeline Array design Probe design Question Experimental
More informationHow do microarrays work
Lecture 3 (continued) Alvis Brazma European Bioinformatics Institute How do microarrays work condition mrna cdna hybridise to microarray condition Sample RNA extract labelled acid acid acid nucleic acid
More informationPattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition
Pattern Recognition Kjell Elenius Speech, Music and Hearing KTH March 29, 2007 Speech recognition 2007 1 Ch 4. Pattern Recognition 1(3) Bayes Decision Theory Minimum-Error-Rate Decision Rules Discriminant
More informationUnsupervised Learning and Clustering
Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)
More informationChapter 7: Competitive learning, clustering, and self-organizing maps
Chapter 7: Competitive learning, clustering, and self-organizing maps António R. C. Paiva EEL 6814 Spring 2008 Outline Competitive learning Clustering Self-Organizing Maps What is competition in neural
More informationArtificial Neural Networks Unsupervised learning: SOM
Artificial Neural Networks Unsupervised learning: SOM 01001110 01100101 01110101 01110010 01101111 01101110 01101111 01110110 01100001 00100000 01110011 01101011 01110101 01110000 01101001 01101110 01100001
More informationUnsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis
7 Supervised learning vs unsupervised learning Unsupervised Learning Supervised learning: discover patterns in the data that relate data attributes with a target (class) attribute These patterns are then
More informationSeismic regionalization based on an artificial neural network
Seismic regionalization based on an artificial neural network *Jaime García-Pérez 1) and René Riaño 2) 1), 2) Instituto de Ingeniería, UNAM, CU, Coyoacán, México D.F., 014510, Mexico 1) jgap@pumas.ii.unam.mx
More informationClustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani
Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation
More informationClustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York
Clustering Robert M. Haralick Computer Science, Graduate Center City University of New York Outline K-means 1 K-means 2 3 4 5 Clustering K-means The purpose of clustering is to determine the similarity
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2012 http://ce.sharif.edu/courses/90-91/2/ce725-1/ Agenda Features and Patterns The Curse of Size and
More informationINF 4300 Classification III Anne Solberg The agenda today:
INF 4300 Classification III Anne Solberg 28.10.15 The agenda today: More on estimating classifier accuracy Curse of dimensionality and simple feature selection knn-classification K-means clustering 28.10.15
More informationData Clustering Hierarchical Clustering, Density based clustering Grid based clustering
Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering Team 2 Prof. Anita Wasilewska CSE 634 Data Mining All Sources Used for the Presentation Olson CF. Parallel algorithms
More informationStatistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1
Week 8 Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Part I Clustering 2 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group
More informationCluster Analysis for Microarray Data
Cluster Analysis for Microarray Data Seventh International Long Oligonucleotide Microarray Workshop Tucson, Arizona January 7-12, 2007 Dan Nettleton IOWA STATE UNIVERSITY 1 Clustering Group objects that
More informationAnalyzing ICAT Data. Analyzing ICAT Data
Analyzing ICAT Data Gary Van Domselaar University of Alberta Analyzing ICAT Data ICAT: Isotope Coded Affinity Tag Introduced in 1999 by Ruedi Aebersold as a method for quantitative analysis of complex
More informationMicroarray data analysis
Microarray data analysis Computational Biology IST Technical University of Lisbon Ana Teresa Freitas 016/017 Microarrays Rows represent genes Columns represent samples Many problems may be solved using
More informationMicro-array Image Analysis using Clustering Methods
Micro-array Image Analysis using Clustering Methods Mrs Rekha A Kulkarni PICT PUNE kulkarni_rekha@hotmail.com Abstract Micro-array imaging is an emerging technology and several experimental procedures
More informationMixture models and clustering
1 Lecture topics: Miture models and clustering, k-means Distance and clustering Miture models and clustering We have so far used miture models as fleible ays of constructing probability models for prediction
More informationA novel firing rule for training Kohonen selforganising
A novel firing rule for training Kohonen selforganising maps D. T. Pham & A. B. Chan Manufacturing Engineering Centre, School of Engineering, University of Wales Cardiff, P.O. Box 688, Queen's Buildings,
More informationUnsupervised Learning
Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised
More informationEECS730: Introduction to Bioinformatics
EECS730: Introduction to Bioinformatics Lecture 15: Microarray clustering http://compbio.pbworks.com/f/wood2.gif Some slides were adapted from Dr. Shaojie Zhang (University of Central Florida) Microarray
More informationDS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li
Welcome to DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li Time: 6:00pm 8:50pm Thu Location: AK 232 Fall 2016 High Dimensional Data v Given a cloud of data points we want to understand
More informationData Mining and Analytics. Introduction
Data Mining and Analytics Introduction Data Mining Data mining refers to extracting or mining knowledge from large amounts of data It is also termed as Knowledge Discovery from Data (KDD) Mostly, data
More informationCluster Analysis: Agglomerate Hierarchical Clustering
Cluster Analysis: Agglomerate Hierarchical Clustering Yonghee Lee Department of Statistics, The University of Seoul Oct 29, 2015 Contents 1 Cluster Analysis Introduction Distance matrix Agglomerative Hierarchical
More informationAdvanced visualization techniques for Self-Organizing Maps with graph-based methods
Advanced visualization techniques for Self-Organizing Maps with graph-based methods Georg Pölzlbauer 1, Andreas Rauber 1, and Michael Dittenbach 2 1 Department of Software Technology Vienna University
More informationWorkload Characterization Techniques
Workload Characterization Techniques Raj Jain Washington University in Saint Louis Saint Louis, MO 63130 Jain@cse.wustl.edu These slides are available on-line at: http://www.cse.wustl.edu/~jain/cse567-08/
More informationUnderstanding Clustering Supervising the unsupervised
Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data
More informationClassification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University
Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate
More informationPreprocessing -- examples in microarrays
Preprocessing -- examples in microarrays I: cdna arrays Image processing Addressing (gridding) Segmentation (classify a pixel as foreground or background) Intensity extraction (summary statistic) Normalization
More informationBased on Raymond J. Mooney s slides
Instance Based Learning Based on Raymond J. Mooney s slides University of Texas at Austin 1 Example 2 Instance-Based Learning Unlike other learning algorithms, does not involve construction of an explicit
More informationLecture Topic Projects
Lecture Topic Projects 1 Intro, schedule, and logistics 2 Applications of visual analytics, basic tasks, data types 3 Introduction to D3, basic vis techniques for non-spatial data Project #1 out 4 Data
More informationHow and what do we see? Segmentation and Grouping. Fundamental Problems. Polyhedral objects. Reducing the combinatorics of pose estimation
Segmentation and Grouping Fundamental Problems ' Focus of attention, or grouping ' What subsets of piels do we consider as possible objects? ' All connected subsets? ' Representation ' How do we model
More informationClustering. Supervised vs. Unsupervised Learning
Clustering Supervised vs. Unsupervised Learning So far we have assumed that the training samples used to design the classifier were labeled by their class membership (supervised learning) We assume now
More informationSegmentation Computer Vision Spring 2018, Lecture 27
Segmentation http://www.cs.cmu.edu/~16385/ 16-385 Computer Vision Spring 218, Lecture 27 Course announcements Homework 7 is due on Sunday 6 th. - Any questions about homework 7? - How many of you have
More informationUnsupervised Learning
Unsupervised Learning Learning without Class Labels (or correct outputs) Density Estimation Learn P(X) given training data for X Clustering Partition data into clusters Dimensionality Reduction Discover
More informationContents. ! Data sets. ! Distance and similarity metrics. ! K-means clustering. ! Hierarchical clustering. ! Evaluation of clustering results
Statistical Analysis of Microarray Data Contents Data sets Distance and similarity metrics K-means clustering Hierarchical clustering Evaluation of clustering results Clustering Jacques van Helden Jacques.van.Helden@ulb.ac.be
More informationHierarchical Clustering 4/5/17
Hierarchical Clustering 4/5/17 Hypothesis Space Continuous inputs Output is a binary tree with data points as leaves. Useful for explaining the training data. Not useful for making new predictions. Direction
More informationHard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering
An unsupervised machine learning problem Grouping a set of objects in such a way that objects in the same group (a cluster) are more similar (in some sense or another) to each other than to those in other
More informationIncorporating Known Pathways into Gene Clustering Algorithms for Genetic Expression Data
Incorporating Known Pathways into Gene Clustering Algorithms for Genetic Expression Data Ryan Atallah, John Ryan, David Aeschlimann December 14, 2013 Abstract In this project, we study the problem of classifying
More informationClustering Jacques van Helden
Statistical Analysis of Microarray Data Clustering Jacques van Helden Jacques.van.Helden@ulb.ac.be Contents Data sets Distance and similarity metrics K-means clustering Hierarchical clustering Evaluation
More informationClustering analysis of gene expression data
Clustering analysis of gene expression data Chapter 11 in Jonathan Pevsner, Bioinformatics and Functional Genomics, 3 rd edition (Chapter 9 in 2 nd edition) Human T cell expression data The matrix contains
More informationFunction approximation using RBF network. 10 basis functions and 25 data points.
1 Function approximation using RBF network F (x j ) = m 1 w i ϕ( x j t i ) i=1 j = 1... N, m 1 = 10, N = 25 10 basis functions and 25 data points. Basis function centers are plotted with circles and data
More information3. Multidimensional Information Visualization II Concepts for visualizing univariate to hypervariate data
3. Multidimensional Information Visualization II Concepts for visualizing univariate to hypervariate data Vorlesung Informationsvisualisierung Prof. Dr. Andreas Butz, WS 2009/10 Konzept und Basis für n:
More informationAcquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.
Summary Statistics Acquisition Description Exploration Examination what data is collected Characterizing properties of data. Exploring the data distribution(s). Identifying data quality problems. Selecting
More informationEECS 730 Introduction to Bioinformatics Microarray. Luke Huan Electrical Engineering and Computer Science
EECS 730 Introduction to Bioinformatics Microarray Luke Huan Electrical Engineering and Computer Science http://people.eecs.ku.edu/~jhuan/ GeneChip 2011/11/29 EECS 730 2 Hybridization to the Chip 2011/11/29
More informationECG782: Multidimensional Digital Signal Processing
ECG782: Multidimensional Digital Signal Processing Object Recognition http://www.ee.unlv.edu/~b1morris/ecg782/ 2 Outline Knowledge Representation Statistical Pattern Recognition Neural Networks Boosting
More informationComparisons and validation of statistical clustering techniques for microarray gene expression data. Outline. Microarrays.
Comparisons and validation of statistical clustering techniques for microarray gene expression data Susmita Datta and Somnath Datta Presented by: Jenni Dietrich Assisted by: Jeffrey Kidd and Kristin Wheeler
More informationKey properties of local features
Key properties of local features Locality, robust against occlusions Must be highly distinctive, a good feature should allow for correct object identification with low probability of mismatch Easy to etract
More informationMetabolomic Data Analysis with MetaboAnalyst
Metabolomic Data Analysis with MetaboAnalyst User ID: guest6522519400069885256 April 14, 2009 1 Data Processing and Normalization 1.1 Reading and Processing the Raw Data MetaboAnalyst accepts a variety
More informationGene expression & Clustering (Chapter 10)
Gene expression & Clustering (Chapter 10) Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species Dynamic programming Approximate pattern matching
More informationWorking with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan
Working with Unlabeled Data Clustering Analysis Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan chanhl@mail.cgu.edu.tw Unsupervised learning Finding centers of similarity using
More informationMultivariate Analysis
Multivariate Analysis Cluster Analysis Prof. Dr. Anselmo E de Oliveira anselmo.quimica.ufg.br anselmo.disciplinas@gmail.com Unsupervised Learning Cluster Analysis Natural grouping Patterns in the data
More informationA Dendrogram. Bioinformatics (Lec 17)
A Dendrogram 3/15/05 1 Hierarchical Clustering [Johnson, SC, 1967] Given n points in R d, compute the distance between every pair of points While (not done) Pick closest pair of points s i and s j and
More informationData Warehousing and Machine Learning
Data Warehousing and Machine Learning Preprocessing Thomas D. Nielsen Aalborg University Department of Computer Science Spring 2008 DWML Spring 2008 1 / 35 Preprocessing Before you can start on the actual
More informationMICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS
Mathematical and Computational Applications, Vol. 5, No. 2, pp. 240-247, 200. Association for Scientific Research MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS Volkan Uslan and Đhsan Ömür Bucak
More informationLesson 3. Prof. Enza Messina
Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical
More informationPreprocessing DWML, /33
Preprocessing DWML, 2007 1/33 Preprocessing Before you can start on the actual data mining, the data may require some preprocessing: Attributes may be redundant. Values may be missing. The data contains
More informationUsing Machine Learning to Optimize Storage Systems
Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation
More informationCluster Analysis. Ying Shen, SSE, Tongji University
Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group
More informationGene Expression an Overview of Problems & Solutions: 1&2. Utah State University Bioinformatics: Problems and Solutions Summer 2006
Gene Expression an Overview of Problems & Solutions: 1&2 Utah State University Bioinformatics: Problems and Solutions Summer 2006 Review DNA mrna Proteins action! mrna transcript abundance ~ expression
More informationIntroduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering
Introduction to Pattern Recognition Part II Selim Aksoy Bilkent University Department of Computer Engineering saksoy@cs.bilkent.edu.tr RETINA Pattern Recognition Tutorial, Summer 2005 Overview Statistical
More informationData Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha
Data Preprocessing S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha 1 Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking
More informationBBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler
BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for
More informationECS 234: Data Analysis: Clustering ECS 234
: Data Analysis: Clustering What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed
More informationCHAPTER 4: CLUSTER ANALYSIS
CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis
More informationCluster Analysis. Angela Montanari and Laura Anderlucci
Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a
More informationEE795: Computer Vision and Intelligent Systems
EE795: Computer Vision and Intelligent Systems Spring 2012 TTh 17:30-18:45 WRI C225 Lecture 04 130131 http://www.ee.unlv.edu/~b1morris/ecg795/ 2 Outline Review Histogram Equalization Image Filtering Linear
More informationDATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm
DATA MINING LECTURE 7 Hierarchical Clustering, DBSCAN The EM Algorithm CLUSTERING What is a Clustering? In general a grouping of objects such that the objects in a group (cluster) are similar (or related)
More informationDiscussion: Clustering Random Curves Under Spatial Dependence
Discussion: Clustering Random Curves Under Spatial Dependence Gareth M. James, Wenguang Sun and Xinghao Qiao Abstract We discuss the advantages and disadvantages of a functional approach to clustering
More informationClustering algorithms and autoencoders for anomaly detection
Clustering algorithms and autoencoders for anomaly detection Alessia Saggio Lunch Seminars and Journal Clubs Université catholique de Louvain, Belgium 3rd March 2017 a Outline Introduction Clustering algorithms
More informationOlmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.
Center of Atmospheric Sciences, UNAM November 16, 2016 Cluster Analisis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster)
More informationSelf-Organizing Maps for cyclic and unbounded graphs
Self-Organizing Maps for cyclic and unbounded graphs M. Hagenbuchner 1, A. Sperduti 2, A.C. Tsoi 3 1- University of Wollongong, Wollongong, Australia. 2- University of Padova, Padova, Italy. 3- Hong Kong
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2013 http://ce.sharif.edu/courses/91-92/2/ce725-1/ Agenda Features and Patterns The Curse of Size and
More informationThe Traveling Salesman
Neural Network Approach To Solving The Traveling Salesman Problem The Traveling Salesman The shortest route for a salesman to visit every city, without stopping at the same city twice. 1 Random Methods
More informationSupervised vs unsupervised clustering
Classification Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful
More informationELEC Dr Reji Mathew Electrical Engineering UNSW
ELEC 4622 Dr Reji Mathew Electrical Engineering UNSW Review of Motion Modelling and Estimation Introduction to Motion Modelling & Estimation Forward Motion Backward Motion Block Motion Estimation Motion
More information