Analysis of Biological Networks. 1. Clustering 2. Random Walks 3. Finding paths

Similar documents
MCL. (and other clustering algorithms) 858L

Brief description of the base clustering algorithms

Identifying network modules

CS200: Graphs. Prichard Ch. 14 Rosen Ch. 10. CS200 - Graphs 1

CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS

EECS730: Introduction to Bioinformatics

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

Mining Web Data. Lijun Zhang

CS224W: Analysis of Networks Jure Leskovec, Stanford University

Graphs. Tessema M. Mengistu Department of Computer Science Southern Illinois University Carbondale Room - Faner 3131

COMP Page Rank

CS6200 Information Retreival. The WebGraph. July 13, 2015

Chapter 14. Graphs Pearson Addison-Wesley. All rights reserved 14 A-1

Mining Web Data. Lijun Zhang

Degree Distribution: The case of Citation Networks

V 1 Introduction! Mon, Oct 15, 2012! Bioinformatics 3 Volkhard Helms!

PPI Network Alignment Advanced Topics in Computa8onal Genomics

Extracting Information from Complex Networks

Unsupervised Learning : Clustering

CS 231A CA Session: Problem Set 4 Review. Kevin Chen May 13, 2016

Oh Pott, Oh Pott! or how to detect community structure in complex networks

L22-23: Graph Algorithms

Lesson 4. Random graphs. Sergio Barbarossa. UPC - Barcelona - July 2008

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

CS200: Graphs. Rosen Ch , 9.6, Walls and Mirrors Ch. 14

Social-Network Graphs

Introduction to Mathematical Programming IE406. Lecture 16. Dr. Ted Ralphs

Introduction to Parallel & Distributed Computing Parallel Graph Algorithms

Collaborative filtering based on a random walk model on a graph

Local Algorithms for Sparse Spanning Graphs

A Tutorial on the Probabilistic Framework for Centrality Analysis, Community Detection and Clustering

Introduction to Graph Theory

Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes.

CSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection

Sampling Large Graphs: Algorithms and Applications

Lecture 17 November 7

Social Network Analysis

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Course Introduction / Review of Fundamentals of Graph Theory

An Unsupervised Approach for Combining Scores of Outlier Detection Techniques, Based on Similarity Measures

Apache Giraph. for applications in Machine Learning & Recommendation Systems. Maria Novartis

Community Detection. Community

Clustering CS 550: Machine Learning

Finding and Visualizing Graph Clusters Using PageRank Optimization. Fan Chung and Alexander Tsiatas, UCSD WAW 2010

Algorithms and Applications in Social Networks. 2017/2018, Semester B Slava Novgorodov

Bipartite Perfect Matching in O(n log n) Randomized Time. Nikhil Bhargava and Elliot Marx

Graph and Digraph Glossary

Social Networks. Slides by : I. Koutsopoulos (AUEB), Source:L. Adamic, SN Analysis, Coursera course

Basics of Network Analysis

University of Florida CISE department Gator Engineering. Clustering Part 2

The ADT priority queue Orders its items by a priority value The first item removed is the one having the highest priority value

Graph Theory and Network Measurment

CS-E5740. Complex Networks. Network analysis: key measures and characteristics

G(B)enchmark GraphBench: Towards a Universal Graph Benchmark. Khaled Ammar M. Tamer Özsu

Pregel. Ali Shah

CSE 158 Lecture 6. Web Mining and Recommender Systems. Community Detection

Lecture 3, Review of Algorithms. What is Algorithm?

Classification and Regression

Graph Theory. Graph Theory. COURSE: Introduction to Biological Networks. Euler s Solution LECTURE 1: INTRODUCTION TO NETWORKS.

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Some Graph Theory for Network Analysis. CS 249B: Science of Networks Week 01: Thursday, 01/31/08 Daniel Bilar Wellesley College Spring 2008

Centralities for undirected graphs

Online Social Networks and Media. Community detection

Graphs / Networks. CSE 6242/ CX 4242 Feb 18, Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Information Retrieval and Web Search

Networks in economics and finance. Lecture 1 - Measuring networks

CS 534: Computer Vision Segmentation II Graph Cuts and Image Segmentation

Lecture 1: Introduction and Motivation Markus Kr otzsch Knowledge-Based Systems

Database and Knowledge-Base Systems: Data Mining. Martin Ester

1 Homophily and assortative mixing

Gene expression & Clustering (Chapter 10)

Information Networks: PageRank

GRAPHS, GRAPH MODELS, GRAPH TERMINOLOGY, AND SPECIAL TYPES OF GRAPHS

Diffusion and Clustering on Large Graphs

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

Notes for Lecture 24

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Social Network Analysis

Graph Algorithms. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

Cluster analysis. Agnieszka Nowak - Brzezinska

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

SEMINAR: GRAPH-BASED METHODS FOR NLP

KNOWLEDGE GRAPHS. Lecture 1: Introduction and Motivation. TU Dresden, 16th Oct Markus Krötzsch Knowledge-Based Systems

THE KNOWLEDGE MANAGEMENT STRATEGY IN ORGANIZATIONS. Summer semester, 2016/2017

Biological Networks Analysis

MIDTERM EXAMINATION Networked Life (NETS 112) November 21, 2013 Prof. Michael Kearns

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Link Analysis and Web Search

Cluster Analysis: Basic Concepts and Algorithms

CSE 258 Lecture 6. Web Mining and Recommender Systems. Community Detection

Single link clustering: 11/7: Lecture 18. Clustering Heuristics 1

CS473-Algorithms I. Lecture 13-A. Graphs. Cevdet Aykanat - Bilkent University Computer Engineering Department

CENTRALITIES. Carlo PICCARDI. DEIB - Department of Electronics, Information and Bioengineering Politecnico di Milano, Italy

Exploiting Parallelism to Support Scalable Hierarchical Clustering

6. Object Identification L AK S H M O U. E D U

CSCI5070 Advanced Topics in Social Computing

.. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

Clustering analysis of gene expression data

Clustering in Data Mining

V10 Metabolic networks - Graph connectivity

Transcription:

Analysis of Biological Networks 1. Clustering 2. Random Walks 3. Finding paths

Problem 1: Graph Clustering Finding dense subgraphs Applications Identification of novel pathways, complexes, other modules? Example algorithm: MCODE

The Problem Given a protein interaction network find strongly connected components (clusters) with the network that may correspond to biological functional modules (complexes or pathways)

Some Algorithms MCL Markov CLustering RNSC Restricted Neighborhood Search Clustering SPC Super Paramagnetic Clustering MCODE Molecular COmplex DEtection

Markov Cluster Algorithm Simulates a flow on the graph. Calculates successive powers of the adjacency matrix Parameters One parameter: inflation parameter The process partitions the graph (i.e., no overlapping clusters) The inflation parameter influence the number of clusters generated

Restricted Neighborhood Search Clustering Starts with an initial random clustering Tries to minimize a cost function by iteratively moving vertices between neighboring clusters. Parameters: Number of iterations Diversification frequency. and 5 other parameters

Super Paramagnetic Clustering Hierarchical algorithm inspired from an analogy with the physical properties of a ferromagnetic model subject to fluctuation at nonzero temperature. Parameters: Number of nearest neighbors Temperature

MCODE Weight each vertex by its local neighborhood density (using a modified version of clustering coefficient) Starting from the top weighted vertex, include neighborhood vertices with similar weights to the cluster Remove the vertices from the clusters Continue with the next highest weight vertex in the network May provide overlapping clusters

Core-clustering Coefficient Product of the clustering coefficient of the highest k-core in the neighborhood of a vertex and k.

Problem 2: Finding relationships Random Walks on Graphs Finding important nodes (Google s PageRank) Function prediction Adding new members to known pathways, complexes Finding relationships of genes/diseases in gene-disease networks

Google s PageRank Assumption: A link from page A to page B is a recommendation of page B by the author of A (we say B is successor of A) Quality of a page is related to its in-degree Recursion: Quality of a page is related to its in-degree, and to the quality of pages linking to it PageRank [BP 98]

Definition of PageRank Consider the following infinite random walk (surf): Initially the surfer is at a random page At each step, the surfer proceeds to a randomly chosen web page with probability d to a randomly chosen successor of the current page with probability 1-d The PageRank of a page p is the fraction of steps the surfer spends at p in the limit.

Random walks with restarts on interaction networks Consider a random walker that starts on a source node, s. At every time tick, the walker chooses randomly among the available edges (based on edge weights), or goes back to node s with probability c. 0.2 s 0.4 0.1 0.2 0.4 0.1 0.6 0.3

Random walks on graphs ( t) The probability p s ( v), is defined as the probability of finding the random walker at node v at time t. The steady state probability (v) gives a measure of affinity to node s, and can be computed efficiently using iterative matrix operations. p s

Computing the steady state p vector Let s be the vector that represents the source nodes (i.e., s i =1/n if node i is one the n source nodes, and 0 otherwise). Compute the following until p converges: p = (1-c)A T p + cs where A is the row normalized adjacency matrix and c is the restart probability.

Same example Start nodes: p 1 and p 2

Random walk results Restart probability, c = 0.3

Problem 3: Finding paths Find the best simple path of length k starting from a given node in the graph Applications The biological network is probabilistic (e.g. predicted network) Signaling pathways of known size

Problem definition Given a set I of start vertices, what is the best simple path of length k? Simple path: each vertex is visited once, no cycles Best simple path That is the most probable path i.e., if edge weights show probabilities, the probability of a path can be computed by: w( e i ) for every edgee path

Additive edge weights For an easier formulation of the problem (similar to shortest paths) it is better to work with additive edge weights rather than multiplicative ones. So convert each edge probability to: new_weight (e) = -log weight(e) probabilities between 0 and 1 new weights positive values between 0 and Infinity smaller probabilities will have larger weights and higher probabilities will have smaller weights best path is the shortest path

Formal definition Weight of a path is the sum of the weights of its edges, and the length of a path is the number of vertices it contains. Given an undirected weighted graph G=(V,E,w) with V =n, E =m and a set I of start vertices, we wish to find, for each vertex v, a minimum-weight simple path of length k that starts with I and ends at v. If no such simple path exists, the algorithm should report this fact. Simple-path restriction makes the problem a difficult one. without simple path restriction we can get the a shortest path of desired length by looping at smallest edges back and forth.

Dynamic programming The best simple-path of length k problem can be solved by dynamic programming. Define W(v, S) as the minimum weight of a simple path of length S which starts at some vertex in I, visits each vertex in S, and ends at v. Starting at smaller sets we can use the following recurrence function to fill in a table of W(v, S) for all v and S. Complexity: O(kn k )

Color coding Idea: Instead of using vertex ids (resulting in n k possible subsets of length k), let s assign random colors (out of k possible colors) to the vertices. Instead of searching for paths with distinct vertices, search for paths with distinct colors (colorful paths) This reduces the possible sets to look for to (2 k )

Color-Coding Colorful paths can be found with dynamic programming Key point: a colorful path of length k contains a colorful path of length k-1. Store path information at each node for each subset of k colors Only 2 k color subsets, rather than O(n k ) node subsets Runtime is O(2 k km) << O(kn k ) brute force Space is O(2 k n) << O(kn k ) brute force

Coloring Example A B C A B C D E D E F F I G H II G H Two different colorings on toy graph, k=3 In coloring I, W(A,RGB) is built C->BC->ABC In coloring II, W(A,RGB) is built G->BG->ABG ABC is not colorful in coloring II