University of Wisconsin-Madison Spring 2018 BMI/CS 776: Advanced Bioinformatics Homework #2

Similar documents
Network-based auto-probit modeling for protein function prediction

Louis Fourrier Fabien Gaie Thomas Rolf

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

Machine Problem 8 - Mean Field Inference on Boltzman Machine

Classification Algorithms in Data Mining

University of Florida CISE department Gator Engineering. Clustering Part 4

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

Kernel Density Estimation (KDE)

Clustering Part 4 DBSCAN

7.36/7.91/20.390/20.490/6.802/6.874 PROBLEM SET 3. Gibbs Sampler, RNA secondary structure, Protein Structure with PyRosetta, Connections (25 Points)

Homework Assignment #3

NONPARAMETRIC REGRESSION SPLINES FOR GENERALIZED LINEAR MODELS IN THE PRESENCE OF MEASUREMENT ERROR

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Evaluating Classifiers

Notes and Announcements

Machine Learning. Sourangshu Bhattacharya

Image analysis. Computer Vision and Classification Image Segmentation. 7 Image analysis

HMC CS 158, Fall 2017 Problem Set 3 Programming: Regularized Polynomial Regression

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

SHARPR (Systematic High-resolution Activation and Repression Profiling with Reporter-tiling) User Manual (v1.0.2)

Applying Supervised Learning

List of Exercises: Data Mining 1 December 12th, 2015

Collective classification in network data

Dietrich Paulus Joachim Hornegger. Pattern Recognition of Images and Speech in C++

NOTE: Your grade will be based on the correctness, efficiency and clarity.

Evaluating Classifiers

Exercise 4. AMTH/CPSC 445a/545a - Fall Semester October 30, 2017

CS145: INTRODUCTION TO DATA MINING

Workload Characterization Techniques

Tutorials Case studies

Dependency detection with Bayesian Networks

Round each observation to the nearest tenth of a cent and draw a stem and leaf plot.

Markov Random Fields and Gibbs Sampling for Image Denoising

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Performance Measures

To earn the extra credit, one of the following has to hold true. Please circle and sign.

Visualizing Data: Freq. Tables, Histograms

Programming Exercise 4: Neural Networks Learning

Warm-up as you walk in

[Programming Assignment] (1)

Development in Object Detection. Junyuan Lin May 4th

Network Traffic Measurements and Analysis

CSE547: Machine Learning for Big Data Spring Problem Set 1. Please read the homework submission policies.

Feature Selection. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262

Summary: A Tutorial on Learning With Bayesian Networks

CS281 Section 9: Graph Models and Practical MCMC

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C.

General Instructions. Questions

ChromHMM: automating chromatin-state discovery and characterization

Evaluating Machine Learning Methods: Part 1

Online Algorithm Comparison points

6.034 Design Assignment 2

ECS 189G: Intro to Computer Vision, Spring 2015 Problem Set 3

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

Multi-label classification using rule-based classifier systems

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

Markov Random Fields and Segmentation with Graph Cuts

Homework /681: Artificial Intelligence (Fall 2017) Out: October 18, 2017 Due: October 29, 2017 at 11:59PM

Detecting Novel Associations in Large Data Sets

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp

1 Case study of SVM (Rob)

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Comparing Dropout Nets to Sum-Product Networks for Predicting Molecular Activity

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

K-Nearest Neighbors. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Learning Meanings for Sentences with Recursive Autoencoders

Midterm Examination CS540-2: Introduction to Artificial Intelligence

Using the DATAMINE Program

CSE 547: Machine Learning for Big Data Spring Problem Set 2. Please read the homework submission policies.

15-780: Problem Set #2

YEAR 12 Core 1 & 2 Maths Curriculum (A Level Year 1)

Markov Networks in Computer Vision

Homework #4 Programming Assignment Due: 11:59 pm, November 4, 2018

Back-to-Back Stem-and-Leaf Plots

Fathom Dynamic Data TM Version 2 Specifications

Machine Learning A WS15/16 1sst KU Version: January 11, b) [1 P] For the probability distribution P (A, B, C, D) with the factorization

CS 229 Midterm Review

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

Probabilistic Classifiers DWML, /27

DETECTION AND ROBUST ESTIMATION OF CYLINDER FEATURES IN POINT CLOUDS INTRODUCTION

Markov Networks in Computer Vision. Sargur Srihari

Section 4 Matching Estimator

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

Use of Extreme Value Statistics in Modeling Biometric Systems

1) Give decision trees to represent the following Boolean functions:

3 : Representation of Undirected GMs

Feature selection. LING 572 Fei Xia

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

mirsig: a consensus-based network inference methodology to identify pan-cancer mirna-mirna interaction signatures

Part II: A broader view

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

CHAPTER 6. The Normal Probability Distribution

2. On classification and related tasks

Package fbroc. August 29, 2016

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Analysis of Biological Networks. 1. Clustering 2. Random Walks 3. Finding paths

Classification. 1 o Semestre 2007/2008

Introduction to Objective Analysis

Transcription:

Assignment goals Use mutual information to reconstruct gene expression networks Evaluate classifier predictions Examine Gibbs sampling for a Markov random field Control for multiple hypothesis testing with q-values Instructions To submit your assignment, log in to the biostat server mi1.biostat.wisc.edu or mi2.biostat.wisc.edu using your biostat username and password. Copy all relevant files to the directory /u/medinfo/handin/bmi776/hw2/<username> where <USERNAME> is your biostat username. Submit all of your Python source code and test that it runs on mi1.biostat.wisc.edu or mi2.biostat.wisc.edu without error. Do not test your code on adhara.biostat.wisc.edu. For the rest of the assignment, compile all of your answers in a single file and submit as solution.pdf. Write the number of late days you used at the top of solution.pdf. For the written portions of the assignment, show your work for partial credit. Part 1: Mutual information in regulatory networks In class, we saw how FIRE uses mutual information to detect relationships between sequence motifs and gene expression levels. Mutual information is also a popular technique for reconstructing transcriptional regulatory networks from gene expression datasets 1. After measuring gene expression levels for all genes in a sufficient number of biological conditions, mutual information can detect some types of pairwise dependencies that may suggest one gene is a regulator (transcription factor) and another is its target. Fluctuations in the regulator s expression can influence the expression levels of the target gene. We can create an undirected gene-gene network by computing mutual information for all pairs of genes. Thresholding the mutual information produces a set of genegene edges. 1 http://link.springer.com/article/10.1186%2f1471-2105-7-s1-s7 1/9

For this assignment, we will use the DREAM3 2 network inference challenge dataset. The file data.txt has a 21 time point simulation of the gene expression levels for ten genes over four simulated replicates. The first line of the file provides the column labels. The first column is the time point, which you will not need. The other columns are the expression levels of the gene named in the first line, where genes are represented with a numeric index. The file is in a tab-delimited format. 1A: Mutual information via discrete binning Complete the program CalcMI.py using the provided template that takes as input the expression data for a set of genes over a number of conditions and reconstructs pairwise gene-gene dependencies using mutual information. The program will output the list of gene-gene dependencies and their mutual information. It should only consider the dependencies between pairs of unique genes, not the mutual information of a gene and itself (the entropy of that gene s expression). Compute mutual information by discretizing the gene expression values, mapping continuous gene expression into discrete bins, as in HW0. For a pair of genes, for example G1 and G2, construct a count matrix that tracks the number of times G1 s expression is in some bin a and G2 s expression is in some bin b. Add a pseudocount of 0.1 to all entries in the G1:G2 count matrix. From this count matrix, you can compute the terms P(G1 = a), P(G2 = b), and P(G1 = a, G2 = b) needed to calculate mutual information. You can run your program from the command line as follows: CalcMI.py -dataset=<dataset> --bin_num=<bins> -out=<out> where <dataset> is the name of the text file that contains the gene expression data formatted according to the description above. <bin_num> is an integer that represents the number of bins that should be used to discretize the continuous gene expression data when calculating the mutual information. In this part, you should use a 2 http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0009202 2/9

uniform binning of the gene expression range. For example, if a gene has expression values in the range [1, 11] and bin_num = 4, the bins would be [1, 3.5), [3.5, 6), [6, 8.5), and [8.5, 11]. Later, you will consider an alternative strategy so make your code flexible. <out> is the name of the text file into which the program will print all unique gene pairs and their mutual information values in decreasing order. After rounding mutual information to three decimal places, break ties based on the index of the first gene and then the index of the second gene if needed, sorting genes indexes in ascending order 3. Each line in the file should contain the undirected gene-gene edge and its mutual information separated by a tab \t. The file should be formatted as follows: (6,9) 0.817 (3,10) 0.633...... (5,8) 0.480 (3,7) 0.463...... (8,9) 0.427 1B: Implement a different binning strategy In this part, you will try a different binning strategy for the gene expression data. Add a new argument str to implement the equal density binning strategy. This uses a percentile-based assignment to assign expression values to bins, as in HW0. For example, with bin_num = 2 and str = density, the lowest 50% of a gene s expression values would be mapped to bin 0 and the highest 50% would be mapped to bin 1. You can run your program from the command line as follows: CalcMI.py -dataset=<dataset> --bin_num=<bins> --str=density -out=<out> 3 See https://docs.python.org/3/howto/sorting.html for sorting tips 3/9

1C: Mutual information via kernel density estimation The definition of mutual information naturally extends to continuous random variables. In this part, you will calculate mutual information of gene pairs on the same data using a kernel density estimator with a Gaussian kernel. Kernel density estimation is a nonparametric method for learning a smooth probability density function (pdf) from observed data. A kernel density estimator for n training points x 1,, x n is defined to be f n (x) = 1 nh K (x x i h ) where the kernel K enables the smoothing effect and the bandwidth h determines how influential a point is in its neighborhood. n i=1 To abstract away the technical details of the kernel, you are required to use the SciPy function gaussian_kde for estimating the pairwise joint probability distribution of gene expression levels 4. Continuous mutual information is defined with a double integral over both variables. To avoid the integration, you will sample the estimated pdf on a 100 100 grid of points uniformly distributed in the region [-0.1, 1.1] [-0.1,1.1] and use the sampled densities to approximate the estimated pdf 5. Add 0.001 to the density at each sampling location in the grid to avoid precision error. The marginal probability of a single gene expression level can then be estimated via approximate integration over the other dimension (that is, a finite sum over 100 intervals defined by the grid). Similarly, mutual information of a gene pair can be computed via approximate integration over the region [-0.1, 1.1] [-0.1,1.1] (that is, a finite sum over 100 100 squares defined by the grid). You can run your program from the command line as follows: CalcMI.py -dataset=<dataset> --str=kernel -out=<out> 4,5 See https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.stats.gaussian_kde.html for examples, including np.mgrid syntax. The integration will also be discussed on Piazza. 4/9

1D: Plot the receiver operating characteristic (ROC) curve Complete another program plot.py to plot the ROC curve 6 for the output of CalcMI.py. The gold standard edges, that is, the edges in the true genegene network, are tabulated in a file called network.txt as two columns. Each line represents an undirected edge between two different genes. The genes in an edge are always listed with the smaller index first. Code for generating the plot and computing the area under the ROC curve (AUROC) is provided. You task is to add code for calculating the points on the curve. You can run your program from the command line as follows: plot.py -MI=<MI> --gold=<gold_network> --name=<name> where <MI> is the path of the output file from CalcMI.py, formatted and sorted as in 1A <gold_network> is the gold standard network file for calculating TPR and FPR. <name> is the name for your plot image. plot.py will automatically append.png so only provide a name without the file type extension. 1E: Evaluate your predictions After you compute the mutual information, plot the ROC curves and calculate the AUROC scores for the three following scenarios: bin_num = 7 bin_num = 9 str = uniform str = density str = kernel Include the mutual information output files and plots in your handin directory and the AUROC scores in your solution PDF. Which combination gave the best AUROC? 7 6 The ROC curve is defined in the assigned reading Lever et al. (2016). More detailed instructions for computing it are on slide 24 of the CS 760 slides http://pages.cs.wisc.edu/~dpage/cs760/evaluating.pdf 7 Note that in a real research problem it would be improper to select the hyperparameters (the number of bins and binning strategy) based on the AUROC on the gold standard network. Also note that AUROC is not actually an appropriate metric for this task because the gold standard has few positive edges. 5/9

Input files, example output files, and template Python files can be downloaded from https://www.biostat.wisc.edu/bmi776/hw/hw2_files.zip Part 2: Gibbs sampling Consider a fictitious cellular signaling cascade in the form of a binary tree with 15 nodes. Each node i represents a gene X i with two possible values, 1 for on and -1 for off. Let R be the root node (that is, gene X 1 ) and S be the right-most leaf node (that is, gene X 15 ). The network is represented by the following probabilistic model: P(X 1, X 2,, X 14, X 15 ) = 1 Z ψ(x i, X j ) where (i, j) s are pairs of directly connected nodes, ψ(x i, X j )=e X ix j is called a potential function, and Z is the normalization factor given by (i,j) Z = ψ(x i, X j ) X 1,X 2,,X 14,X 15 (i,j) where the sum is over all possible configurations. This is an instance of a Markov random field (a.k.a. undirected graphical model). 2A: Probabilities in the Markov random field Answer the following questions in your solution PDF: How many possible configurations are there in this network? What are the most probable and the least probable configurations? Is it true that P(R = 1 S = 1)= P(S = 1 R = 1)? Show your work. 6/9

2B: Gibbs sampling in the Markov random field Next, write a Gibbs sampler, gibbs.py, to estimate P(R = 1 S = 1). This Python script does not take any input file and outputs the estimated probability P(R = 1 S = 1) to the screen. Start sampling from the configuration where all X i except S are set to -1, and S is set to 1. Iteratively update nodes with one of the following strategies: (a) traverse the tree by stepping through the levels in top-down order, and scan through the nodes within each level from left to right, or (b) randomly pick a node at each iteration. In sampling-based inference, we typically discard the initial samples during a burn-in period. Discard the first 10 4 samples your Gibbs sampler generates. Estimate P(R = 1 S = 1) using the next 10 5 samples. Run your sampler three times and report your three estimated P(R = 1 S = 1) values in your solution PDF. If you want to check whether your Gibbs sampler is correct, you can also write code that computes the exact probability P(R = 1 S = 1) by enumerating all possible configurations. No code needs to be submitted for this optional sanity check. Part 3: Calculating q-values You will manually calculate q-values from a p-value distribution. 3A: Estimating λ First, use the histogram of the p-value distribution below to estimate λ visually as in Storey and Tibshirani 2003. The distribution contains p- values for 20000 features. Estimate λ to the nearest 0.1 and report the value you estimated. 7/9

3B: Estimating ˆ0 ( ) Use the table below to estimate ˆ0 ( ) for the value of λ that you selected. Report the ˆ 0 you estimated rounded to two decimal places. λ #{ p i ; i 1 m} 0.0 20000 0.1 15427 0.2 12893 0.3 11382 0.4 9834 0.5 8466 0.6 7030 0.7 5259 0.8 3484 0.9 1714 3C: Calculating q-values Although 20000 features have been tested, only the top 10 features ranked by p-value are listed below: 8/9

Rank p-value 1 0.000003 2 0.000007 3 0.000013 4 0.000024 5 0.000028 6 0.000033 7 0.000046 8 0.000055 9 0.000096 10 0.000099 Calculate the q-value for these 10 features rounded to three decimal places. You may assume that the q-values for the remaining features not shown in the list above do not affect the q-values of these 10 features. 9/9