Detecting Novel Associations in Large Data Sets

Size: px

Start display at page:

Download "Detecting Novel Associations in Large Data Sets"

Michael Johns
5 years ago
Views:

1 Detecting Novel Associations in Large Data Sets J. Hjelmborg Department of Biostatistics 5. februar 2013 Biostat (Biostatistics) Detecting Novel Associations in Large Data Sets 5. februar / 22

2 Overview 1 Introduction 2 Heuristics 3 Examples 4 Estimation 5 Project proposal 6 Conclusion Biostat (Biostatistics) Detecting Novel Associations in Large Data Sets 5. februar / 22

3 Review of paper(s) Detecting Novel Associations in Large Data Sets David N. Reshef et al. Science 334, 1518 (2011). A Correlation for the 21st Century Terry Speed. Science 334, 1502 (2011).

4 Summary The maximal information coefficient MIC is a measure of two-variable dependence designed specifically for rapid exploration of many-dimensional data sets. MIC is part of a larger family of maximal information-based nonparametric exploration (MINE) statistics, which can be used not only to identify important relationships in data sets but also to characterize them.

5 Measuring dependence Given many-dimensional dataset. Search for any association between pairs of variables X and Y and rank these. Generality: Any interesting association should be captured by the statistic, not even only all functional dependencies. Equitability: Relationships of different types with same amount of noise should have similar scores. In particular, functional dependence with similar R 2 values should have similar scores.

6 Classic Measure of Uncertainty Given discrete random variable X on states {1,..., M} with probabilities {p 1,..., p M }. H(p 1,..., p M ) = M k=1 p k log(p k ) measures the uncertainty of X. -the entropy, the only function satisfying the axioms of uncertainty. see Amber (1986) or graduate level textbook in physics or computer science. -the minimum average number of "yes and no"questions required to determine the result of one observation of X. Measure of information conveyed about X by Y : I(X Y ) = H(X) H(X Y )

7 Definition Let D R 2 be a finite set of ordered pairs. Partitioning elements of D into bins induces an x y grid, G xy covering D. For a grid G, let D G be the distribution induced by the points in D on the cells of G. Definition of Maximal Information Coefficient Define I (D, x, y) = max{i(d G )}, where the maximum is over all grids G with x columns and y rows. Define the characteristic matrix, M(D), with entries M(D) x,y = Define MIC(D) = max xy<b(n) {M(D) x,y }. I (D,x,y) log min{x,y}

8 Overview 1 Introduction 2 Heuristics 3 Examples 4 Estimation 5 Project proposal 6 Conclusion Biostat (Biostatistics) Detecting Novel Associations in Large Data Sets 5. februar / 22

9 (A) For each pair (x,y) find the x-by-y grid with the highest induced mutual information. (B) Characteristic Matrix that stores, for each resolution, the best grid at that resolution and its normalized score. (C) MIC corresponds to the highest point on this surface of normalized scores.

10 Properties Paper: The following statements are formalized and proved: The MIC of data sampled from a distribution (X; Y ), where X and Y are continuous random variables, converges to 0 as sample size grows if and only if X and Y are statistically independent. (Theorem 1) The MIC of a noiseless functional relationship converges to 1 as sample size grows, provided the function governing the relationship is nowhere-constant. (Theorem 3) More generally, the MIC of data sampled a finite union of images of nowhere-flat, nowherevertical differentiable curves will approach 1 as sample size grows. (Theorem 4) For any nowhere-constant function, a set of points drawn from the curve defined by the function and then vertically perturbed will receive an MIC that is lower bounded in terms of the amount of perturbation, given a large enough sample size. Moreover, this lower bound can be stated in terms of R 2. (Theorem 5)

11 Overview 1 Introduction 2 Heuristics 3 Examples 4 Estimation 5 Project proposal 6 Conclusion Biostat (Biostatistics) Detecting Novel Associations in Large Data Sets 5. februar / 22

12 Example 1: Characteristic matrices

17 Overview 1 Introduction 2 Heuristics 3 Examples 4 Estimation 5 Project proposal 6 Conclusion Biostat (Biostatistics) Detecting Novel Associations in Large Data Sets 5. februar / 22

18 Estimation of MIC The space of grids that must be searched to compute each entry of the characteristic matrix grows exponentially with the number of data points. For efficiency a heuristic dynamic programming algorithm is used to approximate MIC in practice. In paper: B(n) = n 0.6, which is found to work well in practice. In paper: The FDR is controlled for all analyses using the Benjamini and Hochberg procedure.

19 Overview 1 Introduction 2 Heuristics 3 Examples 4 Estimation 5 Project proposal 6 Conclusion Biostat (Biostatistics) Detecting Novel Associations in Large Data Sets 5. februar / 22

20 Project proposal epigenetic twin data gwas (gwas18)

21 Overview 1 Introduction 2 Heuristics 3 Examples 4 Estimation 5 Project proposal 6 Conclusion Biostat (Biostatistics) Detecting Novel Associations in Large Data Sets 5. februar / 22

22 Conclusion Identifying interesting relationships between pairs of variables in large data sets! MIC captures a wide range of associations both functional and not, and for functional relationships provides a score that roughly equals the coefficient of determination (R2) of the data relative to the regression function. Paper: Application of MIC and MINE to data sets in global health, gene expression, major-league baseball, and the human gut microbiota and identify known and novel relationships.

Supporting Online Material for

Corrected 5 February 22; see below www.sciencemag.org/cgi/content/full/334/662/58/dc Supporting Online Material for Detecting Novel Associations in Large Data Sets David N. Reshef, * Yakir A. Reshef, *