Bioinformatics - Lecture 07

Similar documents
Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Contents. Preface to the Second Edition

Machine Learning. Unsupervised Learning. Manfred Huber

Supervised vs unsupervised clustering

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Unsupervised Learning

Gene Clustering & Classification

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

ECG782: Multidimensional Digital Signal Processing

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

10-701/15-781, Fall 2006, Final

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

What is machine learning?

Unsupervised Learning: Clustering

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Using Machine Learning to Optimize Storage Systems

Machine Learning: Think Big and Parallel

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA. By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr.

Image Analysis, Classification and Change Detection in Remote Sensing

CSE 158. Web Mining and Recommender Systems. Midterm recap

CLASSIFICATION AND CHANGE DETECTION

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska

Naïve Bayes for text classification

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

Applying Supervised Learning

Exploratory data analysis for microarrays

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Network Traffic Measurements and Analysis

Table Of Contents: xix Foreword to Second Edition

Based on Raymond J. Mooney s slides

Search Engines. Information Retrieval in Practice

Machine Learning in Biology

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Machine Learning. Chao Lan

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

IMAGE ANALYSIS, CLASSIFICATION, and CHANGE DETECTION in REMOTE SENSING

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

CS6220: DATA MINING TECHNIQUES

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Contents. Foreword to Second Edition. Acknowledgments About the Authors

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

ECS 234: Data Analysis: Clustering ECS 234

Chapter 3: Supervised Learning

CS325 Artificial Intelligence Ch. 20 Unsupervised Machine Learning

Optimization Methods for Machine Learning (OMML)

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

Content-based image and video analysis. Machine learning

Hierarchical Clustering

Machine Learning Classifiers and Boosting

CS 229 Midterm Review

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Introduction to Mobile Robotics

Clustering. Lecture 6, 1/24/03 ECS289A

An Unsupervised Technique for Statistical Data Analysis Using Data Mining

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning for OR & FE

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

Unsupervised Learning

Unsupervised Learning

Artificial Neural Networks (Feedforward Nets)

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

Clustering and The Expectation-Maximization Algorithm

5 Learning hypothesis classes (16 points)

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Machine Learning (BSMC-GA 4439) Wenke Liu

Lecture 11: Clustering Introduction and Projects Machine Learning

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Expectation Maximization (EM) and Gaussian Mixture Models

A Dendrogram. Bioinformatics (Lec 17)

Network Traffic Measurements and Analysis

CSE 573: Artificial Intelligence Autumn 2010

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

SOCIAL MEDIA MINING. Data Mining Essentials

CS Introduction to Data Mining Instructor: Abdullah Mueen

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

CHAPTER 4: CLUSTER ANALYSIS

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

Note Set 4: Finite Mixture Models and the EM Algorithm

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Generative and discriminative classification techniques

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

INF 4300 Classification III Anne Solberg The agenda today:

LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS

( ) =cov X Y = W PRINCIPAL COMPONENT ANALYSIS. Eigenvectors of the covariance matrix are the principal components

Machine Learning and Pervasive Computing

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

CS570: Introduction to Data Mining

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Unsupervised Learning

The exam is closed book, closed notes except your one-page cheat sheet.

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Transcription:

Bioinformatics - Lecture 07 Bioinformatics Clusters and networks Martin Saturka http://www.bioplexity.org/lectures/ EBI version 0.4 Creative Commons Attribution-Share Alike 2.5 License

Learning on profiles Supervised and unsupervised methods on expression data. approximation, clustering, classification, inference. crisp and fuzzy relation models. graphical models. Main topics vector approaches - SVM, ANN, kernels - classification data organization - clustering technics - map generation regulatory systems - Bayesian networks - algebraic description

Inference types reasoning logical standard logical rules based reasoning statistical frequent co-occurence based reasoning deduction (logic, recursion) A, A B B induction (frequentist statistics) many A, B A B few A, B A B abduction (Bayesian statistics) A 1 B,..., A n B, B A i

Machine learning methods supervised learning methods with known correct outputs on training data approximation measured data to (continuous) output function growth rate nutrition supply regulation classification measured data to discrete output function expression profiles illness diagnosis regression continuous measured data (cor)relations a gene expression magnitude growth rate unsupervised learning methods without known desired outputs on used data data granulation internal data organization and distribution data visualization overall outer view onto the internal data

MLE maximal likelihood estimation L(y X) = Pr(X Y = y) conditional probability as a function of the unkown condition with known outcome, a reverse view on probability Bernoulli trials example L(θ H = 11, T = 10) Pr(H = 11, T = 10 p = θ) = `21 θ11 (1 θ) 10 11 0 = / θ L(θ H, T ) = `21 11 θ10 (1 θ) 9 (11 21θ) θ = 11/21 used when without a better model maximizations inside dynamic programming technics variance estimation leads to biased sample variance

Regression linear regression least squares for homoskedastic distributions sample mean the best estimation least absolute deviations robust version sample median a safe estimation min x R ( i x i x 2) min x R ( i x i x ) ( i x i x 2) = 0 ( i x i x ) = 0 arithmetic mean x = i x i/n median #x i : x i < x = #x i : x i > x

Parametrization parametrized curve crossing assumption: equal amounts of above / below points the same probabilities to cross / not to cross the curve cross count distribution approaches normal distribution over/under-fitting if not in (N 1)/2 ± (N 1)/2

Distinctions empirical risk minimization for discrimination metamethodology boosting increasing weights of wrong result training cases probably approximately correct learning to achieve high probabilities to make convenient predictions particular methods support vector machines artificial neural networks case-based reasoning nearest neighbour algorithm (naive) bayes classifier decision trees random forests

SVM Support vector machines linear classifier maximal margin linear separation minimal distances for misclassified

Kernel methods non-linear into linear separations in higher dimensional space F(.) linear discriminant given by dot product < F (x i ), F (x j ) > back into low-dimensional space by a kernel K (x i, x j )

VC dimension Vapnik-Chervonenkis dimension classification error estimation more power a method has more prone it is to overfitting misclassifications for a binary f (α) classificator iid samples drawn from an unknown distribution R(α) probability of misclassification - in real usage R emp (α) fraction of misclasified cases of a training set then with probability 1 η, training set of size N R(α) < R emp (α) + h(1+ln(2n/h)) ln(η/4) N h the VC dimension size of maximal sets that f (α) can shatter 3 for a line classifier in a 2D space

ANN artificial neural networks i A i B i C w A1 w A2 w B2 w B3 w B4 w C4 1 w 5 w 15 16 w 25 2 w 6 26 w 27 3 w 7 38 w 48 4 8 w 5o w 6o w 7o n hidden neurons w 8o i X inputs output w weights neuron activation function f 2 (e) = f 2 (i A w A2 + i B w B2 c 2 ) f i is non-linear, usually sigmoid, with c i given constants

ANN learning error backpropagation iterative weight adjusting compute errors for each training case δ = desired computed propagate the δ backward: δ 5 = δ w 5o δ 1 = δ 5 w 15 + δ 6 w 16,... adjust weights to new values w new = w A1 + η δ 1 df 1 (e)/de i A A1 w new 15 = w 15 + η δ 5 df 5 (e)/de f 1 (e)... kind of gradient descent method other (sigmoid function) parameters can be adjusted as well converges to a local minimum of errors

SOM self-organizing maps kind of an unsupervised version of ANNs the map is commonly a 2D array array nodes exhibit a simple property each input connected to each output used to visualize multidimensional data similar parts should behave similarly competitive learning of the network nodes compete to represent particular data objects each node of the array has its vector of weights initially either random or two principal components iterative node weights / property adjusting take a random data object find its best matching node according to nodes weights adjust node weights / property to be more similar to the data adjust somewhat other neighboring nodes too

GTM generative topographic map GTM characteristics non-linear latent variable model probabilistic counterpart to the SOM model a generative model actual data are being modeled as created by mappings from a low-dimensional space into the actual high-dimensional space data visualization is gained according to Bayes theorem the latent-to-data space mappings are Gaussian distributions created densities are iteratively fitted to approximate real data distribution known Gaussian mixtures and radial basis functions algorithms

Nearest neighbor case-based reasoning classification diagnosis set to the most similar determined case how to measure distances between particular cases? k-nn take the k most similar cases, each of them has a vote simple but frequently works for (binary) classification common problems which properties are significant, which are just noise suitable sizes of similar cases, how to avoid outliers

Relations the right descriptive features - the right similar cases search for important gene expressions and patient cases unsupervised methods data clustering for the similar data cases data mining for the important features supervised methods Bayesian network inference informatics and statistics minimum message length informatics and algebra inductive logic programming informatics and logic

Cliques graph approach to clustering transform given data table to a graph - vertices for genes edges for gene pairs with similarity above a threshold to find the least graph alteration to result in a clique graph CAST algorithm iterative heuristic clique generation a clique construction from available vertices initiate with a vertex of maximal degree while a distant vertex taken or close vertex free add the closest vertex into the clique remove the farthest vertex from the clique

Clustering standard clustering technics to make separated homogenous groups center based methods k-means as the standard c-means, qt-clustering hierarchy methods agglomerative bottom-up divisive top-down combinations two-steps approach

Cluster structures complete linkage single hierarchical clustering neighbor joining k-means clustering qt-clustering

Distances how to measure object-object (dis)similarity Euclidean distance [ i (x i y 1 ) 2 ] 1/2 Manhattan distance i x i y 1 Power distance [ i (x i y 1 ) p ] 1/p maximum distance max i { x i y i } Pearson s correlation dot product for normalized data percentage disagreement fraction of x i y i metric significance different powers usually do not significantly alter results more different distance measuring should change cluster compositions

Hierarchical clustering neighbor joining - common agglomerative method hierarchical tree creation joining the most similar clusters single linkage - nearest neighbor distances according to the most similar cluster objects complete linkage - furthest neighbor distances according to the most distant cluster objects average linkage cluster distances as mean distances of respective elements Wards method - information loss minimization takes minimal variance increase for possible cluster pairs

The k-means the most frequently used kind of clustering k-mean clustering algorithm start: choose initial k centers iterate for objects (e.g. genes) being clustered: compute new distances to centers, choose the nearest one for each cluster compute new center end when no cluster changes to put less weight on similar microarrays pros usually fast, does not compute all object-object distances cons amount and initial positions of centers highly affect results

Center count do not add more clusters when it does not increase gained information sufficiently center selection random make k-means clustering several times PCA principal components lie in data clouds data objects choose distant objects, with weights two steps take larger amount of clusters, then do hierarchical clustering on the result centers

Cluster fit how convenient are the gained clusters k-mean clustering intra-cluster vs. out-of-cluster distances for objects ratios of inter-cluster to intra-cluster distances the Dunn s index inter / intra variances hierarchical clustering variances of each cluster similarity for cluster means bootstrapping for objects with suitable inner structures

Alternative clustering qt (quality threshold) clustering choose maximal cluster diameter instead of center count try to make maximal cluster around each data objects take the one with the greatest amount of objects inside call it recursively on the rest of the data more computation intensive - more plausible than k-means can be done alike for maximal cluster sizes soft c-means each object (gene) is in more clusters (gene families) object belonging degrees, sums equal to one similar to k-means, stop when small cluster changes suitable for lower amounts of clusters spectral clustering object segmentation according to similarity Laplacian matrix eigenvector of the second smallest eigenvalue

Dependency description used for characterization, classification and compression BN Bayesian networks what depends on what, which variables are independent then fast, suitable inference computing MML minimal mesage length shortest form of an object / feature description real amount of information an object contains ILP inductive logic programming to prove most of the positive / least of the negative cases accurate given objects vs. background characterization

Graphical models joint distributions Pr(x 1, x 2, x 3 ) = Pr(x 1 x 2, x 3 ) Pr(x 2 x 3 ) Pr(x 3 ) intractable for slightly larger amounts of variables used to compute important probabilities themselves used for conditional probabilities for Bayesian inference conditional independence A and B independent under C: A B C A B C: Pr(A, B C) = Pr(A C) Pr(B C) after a few rearrengements: Pr(A B, C) = Pr(A C) Markov processes are just one example of conditional independence graphs and relations simplifying the structure with stating just some valid dependencies, no edges (conditional) independence edges stating the only dependencies

Bayesian systems 1 2 3 DAG of dependencies 5 4 6 x depends on x x 5 1 2 7 8 9 Pr(x 5 x 1, x 2, x 3, x 4 ) = Pr(x 5 x 1, x 2 ) any node is - given all its parents - (conditionally) independent with all the nodes which are not its descendants (e.g. x 3, x 5 independent at all)

Bayesian network inference inference along the DAG is computed acoording to the decomposition of conditional probabilities inference along the opposite directions is computed acoording to the Bayesian approach A B C Pr(A, B, C) = Pr(A B, C) Pr(B C) Pr(C) how to compute Pr(A = True C = True) Pr(A = True C = True) = Pr(C = True, A = True)/ Pr(C = True) = P B Pr(C = True, B, A = True)/ P A,B Pr(C = True, B, A)

Naive Bayes S S... mail/spam state w individual words i w 1 w 2... w n assumption of independent outcomes used e.g. for spam classification S = argmax s Pr(S = s) j Pr(O j = w j S = s) possible to compute fast online with 10 4 items

Items to remember Nota bene: classification methods learning technics vectors, kernels, SVM, ANN conditional independence clustering methods hierarchical clustering k-means, qt-clustering