Statistical Genomics: Gene Expression Profiling

Size: px
Start display at page:

Download "Statistical Genomics: Gene Expression Profiling"

Transcription

1 Statistical Genomics: Gene Epression Profiling Jung-Hsin Lin School of Pharmacy National Taiwan University r.mc.ntu.edu.tw/~jlin/statisticalgenomics.pdf et 8404

2 Microarray Eperiment Samples Further reading: Pharmacogenomics 3: (00). Biopsy (OCT embedding) microtome HE staining or Immunofluorescence staining RNA QC LCM RNA etraction RNA QC Culture cells or whole blood cells ( lysed in Soln.. D) RNA etraction RNA QC Total RNA T7-based RNA amplification (1 woring days) arna Cy3/Cy5 Direct labeling (1 woring day) Cy3/ Cy5 post-labeling (1 woring days) Cy3-/Cy5 /Cy5- target cdna Hybridization: : 17 hours Scanning : 0.5 woring day Data analysis : 0.5 woring days Procedure I : - 3 wees Procedure II & III : 1 wees

3 Worflow of a cdna Microarray EST fragments arrayed in 96- or 384- well plates are spotted at high density onto a glass microscope slide. Subsequently, two different fluorescently labeled cdna populations derived from independent mrna samples are hybridized to the array. After washing, a laser scan the slide and the ratio of induced fluorescence of the two samples is calculated for each EST, which indicates the relative amount of transcript for EST in the samples.

4 Dye-Swap Eperiment Array A Array B

5 Microarray Data Analysis Feature Etraction MIAME Data Visualization Data Clustering Analysis of Variance Epression Profile Comparison Pathway Identification

6 Feature Etraction Algorithms (Agilent) 1. FindSpots and SpotAnalysis algorithms. PolyOutlierFlagger algorithm 3. BGSub (Bacground subtraction) algorithm 4. DyeNorm algorithm 5. Ratio algorithm

7 1. FindSpots and SpotAnalyzer FindSpots algorithms 1) Locate the corners ) Locate the spots SpotAnalyzer algorithms 1) Define features ) Estimate the radius for the local bacground 3) Reject outliers 4) Calculate the mean signal of the feature 5) Calculate the mean signal of the local bacground 6) Determine if the feature is saturated

8 Feature definition with CooieCutter method Feature definition with WholeSpot method

9 . PolyOutlier algorithm 1) Determine if the feature is a non-uniformity outlier ) Determine if the feature is a population outlier

10 3. BGSubtractor algorithm 1) Calculate the feature bacground-subtracted signal Unadjusted bacground-subtracted signals Adjusted bacground-subtracted signals ) Calculate the significance of feature intensity relative to bacground 3) Determine if the feature bacground-subtracted signal is well above the bacground

11 Choices for Bacground Subtraction Methods Local bacground (Default) Average of all bacground areas Average of negative control features Minimum signal (feature or bacground) Minimum signal (feature) on array

12 4. Dye Normalization algorithm 1) Determine normalization features a) Using all significant, non-control, and nonoutlier features method b) Using list of normalization genes method c) Using Ran Consistency Filter method (Default) ) Calculate the normalization factor 3) Calculate the dye normalized signal

13 Scatter Plot of Normalized Cy5/Cy3 Signals

14 5. Ratio algorithm 1) Calculate the surrogate value ) Calculate the processed signal 3) Calculate the log ratio of feature 4) Calculate the p-value p and error on log ratio of feature

15 Log Ratio vs. Feature ID

16 Histogram of Log Ratio Values

17 ArrayMap Upload the array data and the image. Set up criteria for mapping genes on the array image. Display the array image and circle the found feature on the image.

18 More ArrayMap Functions Intensity 'c:\tet.data' 6e+04 5e+04 4e+04 3e+04 e+04 1e X Y

19 Searching Repeating Genes Upload files for across gene array comparison. Find out repeating genes and save searched results in a file.

20 MIAME is a standard that describes minimal information needed to fully annotate a gene epression eperiment and being MIAME has been a prerequisite for publishing your data. To ensure that the data to be MIAMEcompliant. MIAME 1. Eperimental design. Array Design 3. Samples 4. Hybridization 5. Measurements 6. Normalization and controls

21 3D Data Visualization To provide in-depth 3D scatter plot tool and interactive representations of highly comple data. Epression data values or analysis results can be placed on any of the 3 userdefined aes to create a powerful medium for array data presentation.

22 Tree View of Gene Classification To facilitate powerful, easy navigation within dendrograms. To display gene classifications and eperiment parameters on gene and condition trees, respectively. To simplify eamination of clinical and eperimental data in the contet of clustered epression patterns.

23 Visual Filtering To provide a simplified and visually intuitive user interfaces for filtering tools. Real-time generation of graphs of results.

24 Analysis of Variance (ANOVA) To reliably identify differentially epressed genes. To identify genes capable of discriminating between one or more eperimental parameters or sample phenotypes. Groups of genes identified by epression profiling can be further characterized by performing sequence searches for potential regulatory elements

25 Epression Profile Comparison To eplore all of the eperiments related to a single genome in the databases. To identify target epression patterns and quicly find similar epression profiles within all the normalized sample sets database. To characterizing the results of compound screening eperiments and patient treatment studies

26 Pathway Viewer To visually characterize genes and their epression patterns based on their location within a cellular pathway. Users can design their own pathway diagrams or directly import publicly available pathway maps. Users can predict genes associated with discrete steps in the pathway of interest.

27 Affymetri Data Analysis Flow Chart Chip Description Files Hybridization + Scanning EXP File Image analysis DAT File + CDF File CEL File Processing 1. Bacground Correction. Normalization 3. PM Correction 4. Epression Inde GCOS MAS CHP File Intensity value Absent / Present call RMA Tet File Probe ID + Log (Intensity) RPT File Report File, quality Ecel File

28 Affymetri Data Files : DAT Files.*DAT file ~ 110 MBytes

29 From DAT to CEL

30 Affymetri Data Files : CDF Files

31 MAS 5.0 Analysis output file (*.CHP)

32 RNA Quality Assessment RNA Degradation Plot

33 Statistical Plots : Bo Plot

34 Statistical Plots

35 Scatter Plot and MA plot

36 MAS 5.0 Epression Report File (.RPT)

37 MAS 5.0 Epression Report File (.RPT)

38 Options for Normalization Levels PM & MM (MAS 5.1), PM-MM (DChip), PM only (RMA) Features All, Ran invariant set, Spie in, houseeeping Methods Complete data: no reference chip, information from all arrays used: Quantiles Normalization, MVA plot + Loess Baseline: normalized using reference chip: MAS 5.0, Li-Wong s Model-based, Q-spline

39 Summarization Methods: MAS Cell intensities are preprocessed for global bacground.. An ideal mismatch value is calculated and subtracted to adjust the PM intensity. 3. The adjusted PM intensities are log-transformed to stabilize the variance. 4. The biwight estimator is used to provide a robust mean of the resulting values. Signal is it output as the antilog of the resulting value. 5. Finally, Signal is scaled using a trimmed mean. Probe value the probe value PV for every probe pair j in prob set i. V Signal is calculated as follows: i, j PV ij ma( PM IM d) default δ log ( V ij i, j ), ij j 1,..., n n is the number of probe pairs in the probe set. -0 SignalLogV aluei Tbi ( PVi,1,..., PVi, e ) i target signal sf SignalLogValuei TrimMean(,0.0,0.98 ) TrimMean(SPVbi,0.0,0.98 ) nf TrimMean(SPVe,0.0,0.98 ) ReportedValue( i ) nf sf i ( SignalLogValuei )

40 Summarization Methods: RMA Medianpolish This is the summarization used in the RMA epression summary. A multichip linear fit model is fit to data from each probe set. The medianpolish is an algorithm for fitting this model robustly. It should be noted that the calculated epression values using this summary measures will be in log scale. For a probe set with i 1,, I probes and data from j1,.j arrays, fit the following model: log ( ( ) PM ij where α is β i j is ) a α + β + ε ( ) i probe effect the log ( ) j and ( ) ij epression value

41 Software Shareware/Freeware Bioconductor (R. Gentleman) DNA-Chip Analyser (dchip v1.3) (Li and Wong) RMAEpress: : a simple standalone GUI program for windows for computing the RMA epression measure. Commmerical Affymetri Microarray Suite (MAS 5.1) Affymetri GeneChip Operating Software (GCOS v1.)

42 Gene Ontology GO-Annotator The GO collaborators are developing three structured, controlled vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner.

43 GO-Annota Annotated ted GeneChip Result

44 Reference

45 Random Variables Probability density function (pdf( pdf): f ( ) P( X ) Cumulative distribution function (cdf( cdf): F( ) P( X )

46 The Normal (Gaussian) Distribution f ( μ ) 1 ( ) σ σ The distribution is symmetric around the mean. Approimately 68% of the values are within standard deviations from the mean. Approimately 95% of the values are within 4 standard deviations from the mean. Approimately 99% of the values are within 6 standard deviations from the mean. The infleion points of the curve occurs at μ ±σ π e

47 Mean N μ i 1 N X i N: number of all population X n i 1 n X i n: number of the sample, a subset of all population

48 Mode, Median & Percentile Mode: the value that occurs most often in a data set Median: the value situated in the middle of the ordered list of data P-th Percentile: the value that has p% of members below it and 100-p p % of members above it.

49 Range and Variance Range: Variance X ma X min σ N i 1 ( X μ) i N s n i 1 ( X X ) i n 1 Population variance Variance of a sample

50 Z Value Z X μ σ Any normal distribution can be mapped to standard normal distribution. The standard normal distribution has the mean of zero and one of standard deviation. μ Z 0 1 σ Z

51 p ValueValue P( Z > ) 1 P( Z ) The p-value p provides the information about the amount of trust we can place on a decision made using a given threshold. P-value is the probability of maing an error if we choose as the threshold.

52 Student Student s t s t-test for significantly test for significantly different means different means is the significance that two distributions are having different means. ( ) ( ) B A B A A i B j B j A i D N N N N s 1 1 D B A s t, 1, 1 1 ) ( 1 1/ B A t t N N d B t A ν ν ν ν ν ν ( ) ( ) ds s s b a B b a , ) ( 1 ν t A

53 F-test for significantly different test for significantly different variances variances is the significance level at which the hypothesis 1 has smaller variance than can be rejected. A small numerical value implies a very significant rejection, in turn implying high confidence in the hypothesis that 1 has variance great or equal to. ) var( ) var( B A F 1 1,,, ), ( B A F v N N I F Q ν ν ν ν ν ν ν ν ( ) ds s s b a B b a I b a ), ( 1 ), ( ), ( ν 1 ν F Q

54 Corrected sum of squared and Standard deviation Corrected sum of squared (CSS): Standard deviation n ( X ) i X i 1 σ N i 1 ( X μ) i N s n i 1 ( X X ) i n 1 Population standard deviation Sample standard deviation

55 Covariance and Correlation Covariance and Correlation Covariance: Covariance: (Pearson) correlation coefficient (Pearson) correlation coefficient ( )( ) ( ) ( ) n i i n i i n i i i Y X XY Y Y X X Y Y X X s s Y X Cov ), ( ρ ( )( ) 1 ), ( 1 n Y Y X X Y X Cov n i i i

56 Covariance Matri and Correlation Matri Covariance Matri: σ1 σ1 σ13 K σ 1 σ σ 3 Σ M σ 1 σ σ 3 (Pearson) Correlation Matri σ σ σ 1 σ ij ρ ij σ i σ j

57 Principal Component Analysis Principal Component Analysis () ( ) () ( ) t t j j t i i ij q t q q t q C CT T Λ T Covariance Matri λ f λ λ λ O 3 1 Λ [ ] v f,, v, v, v T 3 1 L i i i v Cv λ

58 Why Principal Component Analysis? There are usually too many degrees of freedom in a comple system. Importance of variables is evaluated statistically. Dimensionality (or degrees of freedom) can be significantly reduced by only looing at most important new coordinates. The first principal component is the normalized linear combination with maimum variance. Principal component is the characteristic vectors of the covariance matri.

59 Clustering Methods: Agglomerative Method (Bottom-Up): proceeded by series of fusions of the n objects into groups. More commonly used. Divisive Method (Top-Down): proceeded by series of partition into finger groups.

60 Anything can be clustered Real Fae

61 Dendrogram A two-dimensional diagram which illustrates the fusions or divisions made at each successive stage of clustering process.

62 Agglomerative methods According to different ways of defining distances (similarities), we have some popular agglomerative methods as follows: Single linage clustering Complete linage clustering Average linage clustering Average group linage Ward s s hierarchical clustering method

63 Single linage clustering Also nown as nearest neighbor technique. Defining feature: distance between groups, D(r,s), is defined as the distance between the closest pair of objects, where only pairs consisting of one object from each group are considered. D(r, s) Min { d ( i, j) : i r, j s} The distance between every possible object pair (i,j) is computed, where object i is in cluster r and object j is in cluster s. The minimum value of these distances is said to be the distance between clusters r and s. In other words, the distance between two clusters is given by the value of the shortest lin between the clusters. At each stage of hierarchical clustering, the clusters r and s, for which D(r,s) is minimum, are merged.

64 Complete linage clustering Also nown as farthest neighbor technique, opposite of single linage. Defining feature: distance between groups, D(r,s), is defined as the distance between the most distant pair of objects, where only pairs consisting of one object from each group are considered. D(r, s) Ma{ d( i, j) : i r, j s} The distance between every possible object pair (i,j) is computed, where object i is in cluster r and object j is in cluster s. The maimum value of these distances is said to be the distance between clusters r and s. In other words, the distance between two clusters is given by the value of the longest lin between the clusters. At each stage of hierarchical clustering, the clusters r and s, for which D(r,s) is maimum, are merged.

65 Average linage clustering Defining feature: distance between groups, D(r,s), is defined as the average distance between all pair of objects, where each pair is made up of one object from each group. D(r, s) Trs /(Nr N s ) where T rs is the sum of all pairwise distances between cluster r and cluster s. N r and N s are the sizes of the clusters r and s respectively. At each stage of hierarchical clustering, the clusters r and s, for which D(r,s) is the minimum, are merged.

66 Average group linage Defining feature: groups once formed are represented by their mean values of each variable, i.e., their mean vector, and intergroup distance, D(r,s), is defined in terms of distance between two such mean vectors. The two clusters r and s are merged such that, after merger, the average pairwise distance within the newly formed cluster, is minimum. Suppose we label the new cluster formed by merging clusters r and s, as t. Then D(r,s), the distance between clusters r and s is computed as D(r, s) Average { d ( i, j) : i & j t, t r s} At each stage of hierarchical clustering, the clusters r and s, for which D(r,s) is minimum, are merged. In this case, those two clusters are merged such that the newly formed cluster, on average, will have minimum pairwise distances between the points in it.

67 -means clustering -means clustering algorithm is an algorithm for partitioning (or clustering) N data points into disjoint subsets S j containing N j data points so as to minimize the sum-of-squares criterion: H j 1 n S j n μ j where n is a vector representing the nth data point and μ n is the geometric centroid of the data points in S j. Algorithm: 1. The data points are assigned at random to the sets.. The centroid is computed for each set, and H is evaluated. Repeat these two steps until H has reached its minimum.

68 Profile Plots associated with -means clustering

69 Self-organizing map (SOM) clustering The Self-Organizing Map (SOM) was introduced by Teuvo Kohonen in 198. The SOM (also nown as the Kohonen feature map) algorithm is one of the best nown artificial neural networ algorithms. In contrast to many other neural networs using supervised learning, the SOM is based on unsupervised learning. Self-adaptive topological maps were initially inspired by modelling perception systems found in mammalian brain. A perception system involves the reception of eternal signals and their processing inside the nervous system. The comple mammalian sills, such as seeing and hearing seemed to bear similarity to each other in the way they wored. Namely, the primary characteristic of these systems is that neighbouring neurons encode input signals which are similar to each other. The SOM is quite a unique ind of neural networ in the sense that it constructs a topology preserving mapping from the high-dimensional space onto map units in such a way that relative distances between data points are preserved. The map units, or neurons, usually form a twodimensional regular lattice where the location of a map unit carries semantic information. The SOM can thus serve as a clustering tool of highdimensional data. Because of its typical two-dimensional shape, it is also easy to visualize. Another important feature of the SOM is its capability to generalize. In other words, it can interpolate between previously encountered inputs.

70 Outline of the SOM algorithm Outline of the SOM algorithm The SOM defines a mapping from high dimensional input data space onto a regular two-dimensional array of neurons. Every neuron i of the map is associated with an n-dimensional reference vector m i [m i1,,m in ] T, where n denotes the dimension of the input vectors. The reference vectors together form a codeboo. The neurons of the map are connected to adjacent neurons by a neighbourhood relation, which dictates the topology, or the structure, of the map. The most common topologies in use are rectangular and heagonal. Adjacent neurons belong to the neighbourhood N i of the neuron i. In the basic SOM algorithm, the topology and the number of neurons remain fied from the beginning. The number of neurons determines the granularity of the mapping, which has an effect on the accuracy and generalization of the SOM. During the training phase, the SOM forms an elastic net that folds onto the "cloud" formed by input data. The algorithm controls the net so that it strives to approimate the density of the data. The reference vectors in the codeboo drift to the areas where the density of the input data is high. Eventually, only few codeboo vectors lie in areas where the input data is sparse.

71 The SOM learning process 1. One sample vector is randomly drawn from the input data set and its similarity (distance) to the codeboo vectors is computed by using e.g. the common Euclidean distance measure: m min { } c m i i. After the Best Matching Unit (BMU) has been found, the codeboo vectors are updated. The BMU itself as well as its topological neighbours are moved closer to the input vector in the input space i.e. the input vector attracts them. The magnitude of the attraction is governed by the learning rate. As the learning proceeds and new input vectors are given to the map, the learning rate gradually decreases to zero according to the specified learning rate function type. Along with the learning rate, the neighbourhood radius decreases as well. The update rule for the reference vector of unit i is the following: mi mi( t + 1) ( t) + α( t) [ ( t) m ( t) ] m ( t), i i N i N 3. The steps 1 and together constitute a single training step and they are repeated until the training ends. The number of training steps must be fied prior to training the SOM because the rate of convergence in the neighbourhood function and the learning rate is calculated accordingly. i c, ( t) c ( t)

72 Dimensionality reduction by SOM In this eample, SOM projects the input space with four dimension into a feature space with only dimension.

73 Gene clustering using SOM Using SOM, two genes that are plotted net to each other are necessarily similar according to the chose distance metric. 1D feature map.

74 Eisen Eisen s Hierarchical Clustering Method Hierarchical Clustering Method ( ) ( ) N t j j j N t i i i N t j j N t i i j j j N t i i i ij E t E N E t E N t E N E t E N E E t E E t E N c ) ( 1, ) ( 1 ) ( 1, ) ( 1 ) ( ) ( 1 σ σ σ σ 1. Construct a matri of similarity measures between all pairs of genes.. Recursively cluster the genes into a tree-lie hierarchy. 3. Determine the boundaries between individual genes. a) Find the pair of clusters with the highest correlation and combine the pair into a single cluster. b) Update the correlation matri using the average values of the newly combined cluster. c) Repeat step (a) and (b) N-1 times until all genes have been clustered. Eisen et al., Proc. Natl. Acad. Sci. USA. 95: (1998)

75 Protein-Protein Interaction Networ Biocarta (

76 Pathway Viewer To visually characterize genes and their epression patterns based on their location within a cellular pathway. Users can design their own pathway diagrams or directly import publicly available pathway maps. Users can predict genes associated with discrete steps in the pathway of interest.

77

78

79

80

81

82

83

84 Modeling Metabolic Insulin Signaling Pathways Am J Physiol Endocrinol Metab 83: E084-E1101, 00

85 Definition of Variables 1 Insulin input Concentration of unbound surface insulin receptors 3 Concentration of unphosphorylated once-bound surface receptors 4 Concentration of phosphorylated twice-bound surface receptors 5 Concentration of phosphorylated once-bound surface receptors 6 Concentration of unbound unphosphorylated intracellular receptors 7 Concentration of phosphorylated twice-bound intracellular receptors 8 Concentration of phosphorylated once-bound intracellular receptors 9 Concentration of unphosphorylated IRS-1 10 Concentration of tyrosine-phosphorylated IRS-1 11 Concentration of unactivated PI 3-inase 1 Concentration of tyrosine-phosphorylated IRS-1/activated PI 3-inase comple 13 Percentage of PI(3,4,5)P3 out of the total lipid population 14 Percentage of PI(4,5)P out of the total lipid population 15 Percentage of PI(3,4)P out of the total lipid population 16 Percentage of unactivated At 17 Percentage of activated At 18 Percentage of unactivated PKC- 19 Percentage of activated PKC- 0 Percentage of intracellular GLUT4 1 Percentage of cell surface GLUT4

86 Mathematical Formulation Mathematical Formulation p p ) [PTP] ( ) ) /(IR ( ) ) /(IR ( [PTP] [PTP] [PTP] ) [PTP]( [PTP] [PTP] dt d dt d dt d dt d dt d dt d dt d dt d dt d dt d ' ' ) ( ) ( [SHIP] [PTEN] [SHIP]) [PTEN] ( dt d dt d dt d dt d dt d dt d dt d dt d dt d dt d

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

/ Computational Genomics. Normalization

/ Computational Genomics. Normalization 10-810 /02-710 Computational Genomics Normalization Genes and Gene Expression Technology Display of Expression Information Yeast cell cycle expression Experiments (over time) baseline expression program

More information

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,

More information

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which

More information

CLUSTERING IN BIOINFORMATICS

CLUSTERING IN BIOINFORMATICS CLUSTERING IN BIOINFORMATICS CSE/BIMM/BENG 8 MAY 4, 0 OVERVIEW Define the clustering problem Motivation: gene expression and microarrays Types of clustering Clustering algorithms Other applications of

More information

Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA

More information

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748 CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs07.html 3/3/08 CAP5510 1 Gene g Probe 1 Probe 2 Probe N 3/3/08 CAP5510

More information

10701 Machine Learning. Clustering

10701 Machine Learning. Clustering 171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among

More information

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010 Cluster Analysis Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX 7575 April 008 April 010 Cluster Analysis, sometimes called data segmentation or customer segmentation,

More information

Microarray Data Analysis (VI) Preprocessing (ii): High-density Oligonucleotide Arrays

Microarray Data Analysis (VI) Preprocessing (ii): High-density Oligonucleotide Arrays Microarray Data Analysis (VI) Preprocessing (ii): High-density Oligonucleotide Arrays High-density Oligonucleotide Array High-density Oligonucleotide Array PM (Perfect Match): The perfect match probe has

More information

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering SYDE 372 - Winter 2011 Introduction to Pattern Recognition Clustering Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 5 All the approaches we have learned

More information

Clustering Techniques

Clustering Techniques Clustering Techniques Bioinformatics: Issues and Algorithms CSE 308-408 Fall 2007 Lecture 16 Lopresti Fall 2007 Lecture 16-1 - Administrative notes Your final project / paper proposal is due on Friday,

More information

Course on Microarray Gene Expression Analysis

Course on Microarray Gene Expression Analysis Course on Microarray Gene Expression Analysis ::: Normalization methods and data preprocessing Madrid, April 27th, 2011. Gonzalo Gómez ggomez@cnio.es Bioinformatics Unit CNIO ::: Introduction. The probe-level

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,

More information

Introduction to GE Microarray data analysis Practical Course MolBio 2012

Introduction to GE Microarray data analysis Practical Course MolBio 2012 Introduction to GE Microarray data analysis Practical Course MolBio 2012 Claudia Pommerenke Nov-2012 Transkriptomanalyselabor TAL Microarray and Deep Sequencing Core Facility Göttingen University Medical

More information

Outline. Advanced Digital Image Processing and Others. Importance of Segmentation (Cont.) Importance of Segmentation

Outline. Advanced Digital Image Processing and Others. Importance of Segmentation (Cont.) Importance of Segmentation Advanced Digital Image Processing and Others Xiaojun Qi -- REU Site Program in CVIP (7 Summer) Outline Segmentation Strategies and Data Structures Algorithms Overview K-Means Algorithm Hidden Markov Model

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

Exploratory Data Analysis using Self-Organizing Maps. Madhumanti Ray

Exploratory Data Analysis using Self-Organizing Maps. Madhumanti Ray Exploratory Data Analysis using Self-Organizing Maps Madhumanti Ray Content Introduction Data Analysis methods Self-Organizing Maps Conclusion Visualization of high-dimensional data items Exploratory data

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Methods for Intelligent Systems

Methods for Intelligent Systems Methods for Intelligent Systems Lecture Notes on Clustering (II) Davide Eynard eynard@elet.polimi.it Department of Electronics and Information Politecnico di Milano Davide Eynard - Lecture Notes on Clustering

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

Dimension reduction : PCA and Clustering

Dimension reduction : PCA and Clustering Dimension reduction : PCA and Clustering By Hanne Jarmer Slides by Christopher Workman Center for Biological Sequence Analysis DTU The DNA Array Analysis Pipeline Array design Probe design Question Experimental

More information

How do microarrays work

How do microarrays work Lecture 3 (continued) Alvis Brazma European Bioinformatics Institute How do microarrays work condition mrna cdna hybridise to microarray condition Sample RNA extract labelled acid acid acid nucleic acid

More information

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition Pattern Recognition Kjell Elenius Speech, Music and Hearing KTH March 29, 2007 Speech recognition 2007 1 Ch 4. Pattern Recognition 1(3) Bayes Decision Theory Minimum-Error-Rate Decision Rules Discriminant

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

Chapter 7: Competitive learning, clustering, and self-organizing maps

Chapter 7: Competitive learning, clustering, and self-organizing maps Chapter 7: Competitive learning, clustering, and self-organizing maps António R. C. Paiva EEL 6814 Spring 2008 Outline Competitive learning Clustering Self-Organizing Maps What is competition in neural

More information

Artificial Neural Networks Unsupervised learning: SOM

Artificial Neural Networks Unsupervised learning: SOM Artificial Neural Networks Unsupervised learning: SOM 01001110 01100101 01110101 01110010 01101111 01101110 01101111 01110110 01100001 00100000 01110011 01101011 01110101 01110000 01101001 01101110 01100001

More information

Unsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

Unsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis 7 Supervised learning vs unsupervised learning Unsupervised Learning Supervised learning: discover patterns in the data that relate data attributes with a target (class) attribute These patterns are then

More information

Seismic regionalization based on an artificial neural network

Seismic regionalization based on an artificial neural network Seismic regionalization based on an artificial neural network *Jaime García-Pérez 1) and René Riaño 2) 1), 2) Instituto de Ingeniería, UNAM, CU, Coyoacán, México D.F., 014510, Mexico 1) jgap@pumas.ii.unam.mx

More information

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation

More information

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York Clustering Robert M. Haralick Computer Science, Graduate Center City University of New York Outline K-means 1 K-means 2 3 4 5 Clustering K-means The purpose of clustering is to determine the similarity

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2012 http://ce.sharif.edu/courses/90-91/2/ce725-1/ Agenda Features and Patterns The Curse of Size and

More information

INF 4300 Classification III Anne Solberg The agenda today:

INF 4300 Classification III Anne Solberg The agenda today: INF 4300 Classification III Anne Solberg 28.10.15 The agenda today: More on estimating classifier accuracy Curse of dimensionality and simple feature selection knn-classification K-means clustering 28.10.15

More information

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering Team 2 Prof. Anita Wasilewska CSE 634 Data Mining All Sources Used for the Presentation Olson CF. Parallel algorithms

More information

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1 Week 8 Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Part I Clustering 2 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group

More information

Cluster Analysis for Microarray Data

Cluster Analysis for Microarray Data Cluster Analysis for Microarray Data Seventh International Long Oligonucleotide Microarray Workshop Tucson, Arizona January 7-12, 2007 Dan Nettleton IOWA STATE UNIVERSITY 1 Clustering Group objects that

More information

Analyzing ICAT Data. Analyzing ICAT Data

Analyzing ICAT Data. Analyzing ICAT Data Analyzing ICAT Data Gary Van Domselaar University of Alberta Analyzing ICAT Data ICAT: Isotope Coded Affinity Tag Introduced in 1999 by Ruedi Aebersold as a method for quantitative analysis of complex

More information

Microarray data analysis

Microarray data analysis Microarray data analysis Computational Biology IST Technical University of Lisbon Ana Teresa Freitas 016/017 Microarrays Rows represent genes Columns represent samples Many problems may be solved using

More information

Micro-array Image Analysis using Clustering Methods

Micro-array Image Analysis using Clustering Methods Micro-array Image Analysis using Clustering Methods Mrs Rekha A Kulkarni PICT PUNE kulkarni_rekha@hotmail.com Abstract Micro-array imaging is an emerging technology and several experimental procedures

More information

Mixture models and clustering

Mixture models and clustering 1 Lecture topics: Miture models and clustering, k-means Distance and clustering Miture models and clustering We have so far used miture models as fleible ays of constructing probability models for prediction

More information

A novel firing rule for training Kohonen selforganising

A novel firing rule for training Kohonen selforganising A novel firing rule for training Kohonen selforganising maps D. T. Pham & A. B. Chan Manufacturing Engineering Centre, School of Engineering, University of Wales Cardiff, P.O. Box 688, Queen's Buildings,

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 15: Microarray clustering http://compbio.pbworks.com/f/wood2.gif Some slides were adapted from Dr. Shaojie Zhang (University of Central Florida) Microarray

More information

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li Welcome to DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li Time: 6:00pm 8:50pm Thu Location: AK 232 Fall 2016 High Dimensional Data v Given a cloud of data points we want to understand

More information

Data Mining and Analytics. Introduction

Data Mining and Analytics. Introduction Data Mining and Analytics Introduction Data Mining Data mining refers to extracting or mining knowledge from large amounts of data It is also termed as Knowledge Discovery from Data (KDD) Mostly, data

More information

Cluster Analysis: Agglomerate Hierarchical Clustering

Cluster Analysis: Agglomerate Hierarchical Clustering Cluster Analysis: Agglomerate Hierarchical Clustering Yonghee Lee Department of Statistics, The University of Seoul Oct 29, 2015 Contents 1 Cluster Analysis Introduction Distance matrix Agglomerative Hierarchical

More information

Advanced visualization techniques for Self-Organizing Maps with graph-based methods

Advanced visualization techniques for Self-Organizing Maps with graph-based methods Advanced visualization techniques for Self-Organizing Maps with graph-based methods Georg Pölzlbauer 1, Andreas Rauber 1, and Michael Dittenbach 2 1 Department of Software Technology Vienna University

More information

Workload Characterization Techniques

Workload Characterization Techniques Workload Characterization Techniques Raj Jain Washington University in Saint Louis Saint Louis, MO 63130 Jain@cse.wustl.edu These slides are available on-line at: http://www.cse.wustl.edu/~jain/cse567-08/

More information

Understanding Clustering Supervising the unsupervised

Understanding Clustering Supervising the unsupervised Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data

More information

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate

More information

Preprocessing -- examples in microarrays

Preprocessing -- examples in microarrays Preprocessing -- examples in microarrays I: cdna arrays Image processing Addressing (gridding) Segmentation (classify a pixel as foreground or background) Intensity extraction (summary statistic) Normalization

More information

Based on Raymond J. Mooney s slides

Based on Raymond J. Mooney s slides Instance Based Learning Based on Raymond J. Mooney s slides University of Texas at Austin 1 Example 2 Instance-Based Learning Unlike other learning algorithms, does not involve construction of an explicit

More information

Lecture Topic Projects

Lecture Topic Projects Lecture Topic Projects 1 Intro, schedule, and logistics 2 Applications of visual analytics, basic tasks, data types 3 Introduction to D3, basic vis techniques for non-spatial data Project #1 out 4 Data

More information

How and what do we see? Segmentation and Grouping. Fundamental Problems. Polyhedral objects. Reducing the combinatorics of pose estimation

How and what do we see? Segmentation and Grouping. Fundamental Problems. Polyhedral objects. Reducing the combinatorics of pose estimation Segmentation and Grouping Fundamental Problems ' Focus of attention, or grouping ' What subsets of piels do we consider as possible objects? ' All connected subsets? ' Representation ' How do we model

More information

Clustering. Supervised vs. Unsupervised Learning

Clustering. Supervised vs. Unsupervised Learning Clustering Supervised vs. Unsupervised Learning So far we have assumed that the training samples used to design the classifier were labeled by their class membership (supervised learning) We assume now

More information

Segmentation Computer Vision Spring 2018, Lecture 27

Segmentation Computer Vision Spring 2018, Lecture 27 Segmentation http://www.cs.cmu.edu/~16385/ 16-385 Computer Vision Spring 218, Lecture 27 Course announcements Homework 7 is due on Sunday 6 th. - Any questions about homework 7? - How many of you have

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Learning without Class Labels (or correct outputs) Density Estimation Learn P(X) given training data for X Clustering Partition data into clusters Dimensionality Reduction Discover

More information

Contents. ! Data sets. ! Distance and similarity metrics. ! K-means clustering. ! Hierarchical clustering. ! Evaluation of clustering results

Contents. ! Data sets. ! Distance and similarity metrics. ! K-means clustering. ! Hierarchical clustering. ! Evaluation of clustering results Statistical Analysis of Microarray Data Contents Data sets Distance and similarity metrics K-means clustering Hierarchical clustering Evaluation of clustering results Clustering Jacques van Helden Jacques.van.Helden@ulb.ac.be

More information

Hierarchical Clustering 4/5/17

Hierarchical Clustering 4/5/17 Hierarchical Clustering 4/5/17 Hypothesis Space Continuous inputs Output is a binary tree with data points as leaves. Useful for explaining the training data. Not useful for making new predictions. Direction

More information

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering An unsupervised machine learning problem Grouping a set of objects in such a way that objects in the same group (a cluster) are more similar (in some sense or another) to each other than to those in other

More information

Incorporating Known Pathways into Gene Clustering Algorithms for Genetic Expression Data

Incorporating Known Pathways into Gene Clustering Algorithms for Genetic Expression Data Incorporating Known Pathways into Gene Clustering Algorithms for Genetic Expression Data Ryan Atallah, John Ryan, David Aeschlimann December 14, 2013 Abstract In this project, we study the problem of classifying

More information

Clustering Jacques van Helden

Clustering Jacques van Helden Statistical Analysis of Microarray Data Clustering Jacques van Helden Jacques.van.Helden@ulb.ac.be Contents Data sets Distance and similarity metrics K-means clustering Hierarchical clustering Evaluation

More information

Clustering analysis of gene expression data

Clustering analysis of gene expression data Clustering analysis of gene expression data Chapter 11 in Jonathan Pevsner, Bioinformatics and Functional Genomics, 3 rd edition (Chapter 9 in 2 nd edition) Human T cell expression data The matrix contains

More information

Function approximation using RBF network. 10 basis functions and 25 data points.

Function approximation using RBF network. 10 basis functions and 25 data points. 1 Function approximation using RBF network F (x j ) = m 1 w i ϕ( x j t i ) i=1 j = 1... N, m 1 = 10, N = 25 10 basis functions and 25 data points. Basis function centers are plotted with circles and data

More information

3. Multidimensional Information Visualization II Concepts for visualizing univariate to hypervariate data

3. Multidimensional Information Visualization II Concepts for visualizing univariate to hypervariate data 3. Multidimensional Information Visualization II Concepts for visualizing univariate to hypervariate data Vorlesung Informationsvisualisierung Prof. Dr. Andreas Butz, WS 2009/10 Konzept und Basis für n:

More information

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data. Summary Statistics Acquisition Description Exploration Examination what data is collected Characterizing properties of data. Exploring the data distribution(s). Identifying data quality problems. Selecting

More information

EECS 730 Introduction to Bioinformatics Microarray. Luke Huan Electrical Engineering and Computer Science

EECS 730 Introduction to Bioinformatics Microarray. Luke Huan Electrical Engineering and Computer Science EECS 730 Introduction to Bioinformatics Microarray Luke Huan Electrical Engineering and Computer Science http://people.eecs.ku.edu/~jhuan/ GeneChip 2011/11/29 EECS 730 2 Hybridization to the Chip 2011/11/29

More information

ECG782: Multidimensional Digital Signal Processing

ECG782: Multidimensional Digital Signal Processing ECG782: Multidimensional Digital Signal Processing Object Recognition http://www.ee.unlv.edu/~b1morris/ecg782/ 2 Outline Knowledge Representation Statistical Pattern Recognition Neural Networks Boosting

More information

Comparisons and validation of statistical clustering techniques for microarray gene expression data. Outline. Microarrays.

Comparisons and validation of statistical clustering techniques for microarray gene expression data. Outline. Microarrays. Comparisons and validation of statistical clustering techniques for microarray gene expression data Susmita Datta and Somnath Datta Presented by: Jenni Dietrich Assisted by: Jeffrey Kidd and Kristin Wheeler

More information

Key properties of local features

Key properties of local features Key properties of local features Locality, robust against occlusions Must be highly distinctive, a good feature should allow for correct object identification with low probability of mismatch Easy to etract

More information

Metabolomic Data Analysis with MetaboAnalyst

Metabolomic Data Analysis with MetaboAnalyst Metabolomic Data Analysis with MetaboAnalyst User ID: guest6522519400069885256 April 14, 2009 1 Data Processing and Normalization 1.1 Reading and Processing the Raw Data MetaboAnalyst accepts a variety

More information

Gene expression & Clustering (Chapter 10)

Gene expression & Clustering (Chapter 10) Gene expression & Clustering (Chapter 10) Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species Dynamic programming Approximate pattern matching

More information

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan Working with Unlabeled Data Clustering Analysis Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan chanhl@mail.cgu.edu.tw Unsupervised learning Finding centers of similarity using

More information

Multivariate Analysis

Multivariate Analysis Multivariate Analysis Cluster Analysis Prof. Dr. Anselmo E de Oliveira anselmo.quimica.ufg.br anselmo.disciplinas@gmail.com Unsupervised Learning Cluster Analysis Natural grouping Patterns in the data

More information

A Dendrogram. Bioinformatics (Lec 17)

A Dendrogram. Bioinformatics (Lec 17) A Dendrogram 3/15/05 1 Hierarchical Clustering [Johnson, SC, 1967] Given n points in R d, compute the distance between every pair of points While (not done) Pick closest pair of points s i and s j and

More information

Data Warehousing and Machine Learning

Data Warehousing and Machine Learning Data Warehousing and Machine Learning Preprocessing Thomas D. Nielsen Aalborg University Department of Computer Science Spring 2008 DWML Spring 2008 1 / 35 Preprocessing Before you can start on the actual

More information

MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS

MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS Mathematical and Computational Applications, Vol. 5, No. 2, pp. 240-247, 200. Association for Scientific Research MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS Volkan Uslan and Đhsan Ömür Bucak

More information

Lesson 3. Prof. Enza Messina

Lesson 3. Prof. Enza Messina Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical

More information

Preprocessing DWML, /33

Preprocessing DWML, /33 Preprocessing DWML, 2007 1/33 Preprocessing Before you can start on the actual data mining, the data may require some preprocessing: Attributes may be redundant. Values may be missing. The data contains

More information

Using Machine Learning to Optimize Storage Systems

Using Machine Learning to Optimize Storage Systems Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation

More information

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis. Ying Shen, SSE, Tongji University Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group

More information

Gene Expression an Overview of Problems & Solutions: 1&2. Utah State University Bioinformatics: Problems and Solutions Summer 2006

Gene Expression an Overview of Problems & Solutions: 1&2. Utah State University Bioinformatics: Problems and Solutions Summer 2006 Gene Expression an Overview of Problems & Solutions: 1&2 Utah State University Bioinformatics: Problems and Solutions Summer 2006 Review DNA mrna Proteins action! mrna transcript abundance ~ expression

More information

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering Introduction to Pattern Recognition Part II Selim Aksoy Bilkent University Department of Computer Engineering saksoy@cs.bilkent.edu.tr RETINA Pattern Recognition Tutorial, Summer 2005 Overview Statistical

More information

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha Data Preprocessing S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha 1 Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for

More information

ECS 234: Data Analysis: Clustering ECS 234

ECS 234: Data Analysis: Clustering ECS 234 : Data Analysis: Clustering What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

CHAPTER 4: CLUSTER ANALYSIS

CHAPTER 4: CLUSTER ANALYSIS CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis

More information

Cluster Analysis. Angela Montanari and Laura Anderlucci

Cluster Analysis. Angela Montanari and Laura Anderlucci Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a

More information

EE795: Computer Vision and Intelligent Systems

EE795: Computer Vision and Intelligent Systems EE795: Computer Vision and Intelligent Systems Spring 2012 TTh 17:30-18:45 WRI C225 Lecture 04 130131 http://www.ee.unlv.edu/~b1morris/ecg795/ 2 Outline Review Histogram Equalization Image Filtering Linear

More information

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm DATA MINING LECTURE 7 Hierarchical Clustering, DBSCAN The EM Algorithm CLUSTERING What is a Clustering? In general a grouping of objects such that the objects in a group (cluster) are similar (or related)

More information

Discussion: Clustering Random Curves Under Spatial Dependence

Discussion: Clustering Random Curves Under Spatial Dependence Discussion: Clustering Random Curves Under Spatial Dependence Gareth M. James, Wenguang Sun and Xinghao Qiao Abstract We discuss the advantages and disadvantages of a functional approach to clustering

More information

Clustering algorithms and autoencoders for anomaly detection

Clustering algorithms and autoencoders for anomaly detection Clustering algorithms and autoencoders for anomaly detection Alessia Saggio Lunch Seminars and Journal Clubs Université catholique de Louvain, Belgium 3rd March 2017 a Outline Introduction Clustering algorithms

More information

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM. Center of Atmospheric Sciences, UNAM November 16, 2016 Cluster Analisis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster)

More information

Self-Organizing Maps for cyclic and unbounded graphs

Self-Organizing Maps for cyclic and unbounded graphs Self-Organizing Maps for cyclic and unbounded graphs M. Hagenbuchner 1, A. Sperduti 2, A.C. Tsoi 3 1- University of Wollongong, Wollongong, Australia. 2- University of Padova, Padova, Italy. 3- Hong Kong

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2013 http://ce.sharif.edu/courses/91-92/2/ce725-1/ Agenda Features and Patterns The Curse of Size and

More information

The Traveling Salesman

The Traveling Salesman Neural Network Approach To Solving The Traveling Salesman Problem The Traveling Salesman The shortest route for a salesman to visit every city, without stopping at the same city twice. 1 Random Methods

More information

Supervised vs unsupervised clustering

Supervised vs unsupervised clustering Classification Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful

More information

ELEC Dr Reji Mathew Electrical Engineering UNSW

ELEC Dr Reji Mathew Electrical Engineering UNSW ELEC 4622 Dr Reji Mathew Electrical Engineering UNSW Review of Motion Modelling and Estimation Introduction to Motion Modelling & Estimation Forward Motion Backward Motion Block Motion Estimation Motion

More information