Detecting Novel Associations in Large Data Sets

Size: px
Start display at page:

Download "Detecting Novel Associations in Large Data Sets"

Transcription

1 Detecting Novel Associations in Large Data Sets J. Hjelmborg Department of Biostatistics 5. februar 2013 Biostat (Biostatistics) Detecting Novel Associations in Large Data Sets 5. februar / 22

2 Overview 1 Introduction 2 Heuristics 3 Examples 4 Estimation 5 Project proposal 6 Conclusion Biostat (Biostatistics) Detecting Novel Associations in Large Data Sets 5. februar / 22

3 Review of paper(s) Detecting Novel Associations in Large Data Sets David N. Reshef et al. Science 334, 1518 (2011). A Correlation for the 21st Century Terry Speed. Science 334, 1502 (2011).

4 Summary The maximal information coefficient MIC is a measure of two-variable dependence designed specifically for rapid exploration of many-dimensional data sets. MIC is part of a larger family of maximal information-based nonparametric exploration (MINE) statistics, which can be used not only to identify important relationships in data sets but also to characterize them.

5 Measuring dependence Given many-dimensional dataset. Search for any association between pairs of variables X and Y and rank these. Generality: Any interesting association should be captured by the statistic, not even only all functional dependencies. Equitability: Relationships of different types with same amount of noise should have similar scores. In particular, functional dependence with similar R 2 values should have similar scores.

6 Classic Measure of Uncertainty Given discrete random variable X on states {1,..., M} with probabilities {p 1,..., p M }. H(p 1,..., p M ) = M k=1 p k log(p k ) measures the uncertainty of X. -the entropy, the only function satisfying the axioms of uncertainty. see Amber (1986) or graduate level textbook in physics or computer science. -the minimum average number of "yes and no"questions required to determine the result of one observation of X. Measure of information conveyed about X by Y : I(X Y ) = H(X) H(X Y )

7 Definition Let D R 2 be a finite set of ordered pairs. Partitioning elements of D into bins induces an x y grid, G xy covering D. For a grid G, let D G be the distribution induced by the points in D on the cells of G. Definition of Maximal Information Coefficient Define I (D, x, y) = max{i(d G )}, where the maximum is over all grids G with x columns and y rows. Define the characteristic matrix, M(D), with entries M(D) x,y = Define MIC(D) = max xy<b(n) {M(D) x,y }. I (D,x,y) log min{x,y}

8 Overview 1 Introduction 2 Heuristics 3 Examples 4 Estimation 5 Project proposal 6 Conclusion Biostat (Biostatistics) Detecting Novel Associations in Large Data Sets 5. februar / 22

9 (A) For each pair (x,y) find the x-by-y grid with the highest induced mutual information. (B) Characteristic Matrix that stores, for each resolution, the best grid at that resolution and its normalized score. (C) MIC corresponds to the highest point on this surface of normalized scores.

10 Properties Paper: The following statements are formalized and proved: The MIC of data sampled from a distribution (X; Y ), where X and Y are continuous random variables, converges to 0 as sample size grows if and only if X and Y are statistically independent. (Theorem 1) The MIC of a noiseless functional relationship converges to 1 as sample size grows, provided the function governing the relationship is nowhere-constant. (Theorem 3) More generally, the MIC of data sampled a finite union of images of nowhere-flat, nowherevertical differentiable curves will approach 1 as sample size grows. (Theorem 4) For any nowhere-constant function, a set of points drawn from the curve defined by the function and then vertically perturbed will receive an MIC that is lower bounded in terms of the amount of perturbation, given a large enough sample size. Moreover, this lower bound can be stated in terms of R 2. (Theorem 5)

11 Overview 1 Introduction 2 Heuristics 3 Examples 4 Estimation 5 Project proposal 6 Conclusion Biostat (Biostatistics) Detecting Novel Associations in Large Data Sets 5. februar / 22

12 Example 1: Characteristic matrices

13

14

15

16

17 Overview 1 Introduction 2 Heuristics 3 Examples 4 Estimation 5 Project proposal 6 Conclusion Biostat (Biostatistics) Detecting Novel Associations in Large Data Sets 5. februar / 22

18 Estimation of MIC The space of grids that must be searched to compute each entry of the characteristic matrix grows exponentially with the number of data points. For efficiency a heuristic dynamic programming algorithm is used to approximate MIC in practice. In paper: B(n) = n 0.6, which is found to work well in practice. In paper: The FDR is controlled for all analyses using the Benjamini and Hochberg procedure.

19 Overview 1 Introduction 2 Heuristics 3 Examples 4 Estimation 5 Project proposal 6 Conclusion Biostat (Biostatistics) Detecting Novel Associations in Large Data Sets 5. februar / 22

20 Project proposal epigenetic twin data gwas (gwas18)

21 Overview 1 Introduction 2 Heuristics 3 Examples 4 Estimation 5 Project proposal 6 Conclusion Biostat (Biostatistics) Detecting Novel Associations in Large Data Sets 5. februar / 22

22 Conclusion Identifying interesting relationships between pairs of variables in large data sets! MIC captures a wide range of associations both functional and not, and for functional relationships provides a score that roughly equals the coefficient of determination (R2) of the data relative to the regression function. Paper: Application of MIC and MINE to data sets in global health, gene expression, major-league baseball, and the human gut microbiota and identify known and novel relationships.

Supporting Online Material for

Supporting Online Material for Corrected 5 February 22; see below www.sciencemag.org/cgi/content/full/334/662/58/dc Supporting Online Material for Detecting Novel Associations in Large Data Sets David N. Reshef, * Yakir A. Reshef, *

More information

Document Clustering: Comparison of Similarity Measures

Document Clustering: Comparison of Similarity Measures Document Clustering: Comparison of Similarity Measures Shouvik Sachdeva Bhupendra Kastore Indian Institute of Technology, Kanpur CS365 Project, 2014 Outline 1 Introduction The Problem and the Motivation

More information

0x1A Great Papers in Computer Security

0x1A Great Papers in Computer Security CS 380S 0x1A Great Papers in Computer Security Vitaly Shmatikov http://www.cs.utexas.edu/~shmat/courses/cs380s/ C. Dwork Differential Privacy (ICALP 2006 and many other papers) Basic Setting DB= x 1 x

More information

Convexization in Markov Chain Monte Carlo

Convexization in Markov Chain Monte Carlo in Markov Chain Monte Carlo 1 IBM T. J. Watson Yorktown Heights, NY 2 Department of Aerospace Engineering Technion, Israel August 23, 2011 Problem Statement MCMC processes in general are governed by non

More information

A Course in Machine Learning

A Course in Machine Learning A Course in Machine Learning Hal Daumé III 13 UNSUPERVISED LEARNING If you have access to labeled training data, you know what to do. This is the supervised setting, in which you have a teacher telling

More information

Some results on Interval probe graphs

Some results on Interval probe graphs Some results on Interval probe graphs In-Jen Lin and C H Wu Department of Computer science Science National Taiwan Ocean University, Keelung, Taiwan ijlin@mail.ntou.edu.tw Abstract Interval Probe Graphs

More information

Information-Theoretic Co-clustering

Information-Theoretic Co-clustering Information-Theoretic Co-clustering Authors: I. S. Dhillon, S. Mallela, and D. S. Modha. MALNIS Presentation Qiufen Qi, Zheyuan Yu 20 May 2004 Outline 1. Introduction 2. Information Theory Concepts 3.

More information

University of Wisconsin-Madison Spring 2018 BMI/CS 776: Advanced Bioinformatics Homework #2

University of Wisconsin-Madison Spring 2018 BMI/CS 776: Advanced Bioinformatics Homework #2 Assignment goals Use mutual information to reconstruct gene expression networks Evaluate classifier predictions Examine Gibbs sampling for a Markov random field Control for multiple hypothesis testing

More information

SEEK User Manual. Introduction

SEEK User Manual. Introduction SEEK User Manual Introduction SEEK is a computational gene co-expression search engine. It utilizes a vast human gene expression compendium to deliver fast, integrative, cross-platform co-expression analyses.

More information

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology 9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example

More information

Integer Programming Theory

Integer Programming Theory Integer Programming Theory Laura Galli October 24, 2016 In the following we assume all functions are linear, hence we often drop the term linear. In discrete optimization, we seek to find a solution x

More information

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate

More information

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data. Code No: M0502/R05 Set No. 1 1. (a) Explain data mining as a step in the process of knowledge discovery. (b) Differentiate operational database systems and data warehousing. [8+8] 2. (a) Briefly discuss

More information

Decision Tree CE-717 : Machine Learning Sharif University of Technology

Decision Tree CE-717 : Machine Learning Sharif University of Technology Decision Tree CE-717 : Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Some slides have been adapted from: Prof. Tom Mitchell Decision tree Approximating functions of usually discrete

More information

Lecture 27, April 24, Reading: See class website. Nonparametric regression and kernel smoothing. Structured sparse additive models (GroupSpAM)

Lecture 27, April 24, Reading: See class website. Nonparametric regression and kernel smoothing. Structured sparse additive models (GroupSpAM) School of Computer Science Probabilistic Graphical Models Structured Sparse Additive Models Junming Yin and Eric Xing Lecture 7, April 4, 013 Reading: See class website 1 Outline Nonparametric regression

More information

4 Generating functions in two variables

4 Generating functions in two variables 4 Generating functions in two variables (Wilf, sections.5.6 and 3.4 3.7) Definition. Let a(n, m) (n, m 0) be a function of two integer variables. The 2-variable generating function of a(n, m) is F (x,

More information

Algorithms, Games, and Networks February 21, Lecture 12

Algorithms, Games, and Networks February 21, Lecture 12 Algorithms, Games, and Networks February, 03 Lecturer: Ariel Procaccia Lecture Scribe: Sercan Yıldız Overview In this lecture, we introduce the axiomatic approach to social choice theory. In particular,

More information

Summary: A Tutorial on Learning With Bayesian Networks

Summary: A Tutorial on Learning With Bayesian Networks Summary: A Tutorial on Learning With Bayesian Networks Markus Kalisch May 5, 2006 We primarily summarize [4]. When we think that it is appropriate, we comment on additional facts and more recent developments.

More information

Gene expression & Clustering (Chapter 10)

Gene expression & Clustering (Chapter 10) Gene expression & Clustering (Chapter 10) Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species Dynamic programming Approximate pattern matching

More information

Lecture Notes 2: The Simplex Algorithm

Lecture Notes 2: The Simplex Algorithm Algorithmic Methods 25/10/2010 Lecture Notes 2: The Simplex Algorithm Professor: Yossi Azar Scribe:Kiril Solovey 1 Introduction In this lecture we will present the Simplex algorithm, finish some unresolved

More information

Random Tilings. Thomas Fernique. Moscow, Spring 2011

Random Tilings. Thomas Fernique. Moscow, Spring 2011 Random Tilings Thomas Fernique Moscow, Spring 2011 1 Random tilings 2 The Dimer case 3 Random assembly 4 Random sampling 1 Random tilings 2 The Dimer case 3 Random assembly 4 Random sampling Quenching

More information

6. Dicretization methods 6.1 The purpose of discretization

6. Dicretization methods 6.1 The purpose of discretization 6. Dicretization methods 6.1 The purpose of discretization Often data are given in the form of continuous values. If their number is huge, model building for such data can be difficult. Moreover, many

More information

Modern Multidimensional Scaling

Modern Multidimensional Scaling Ingwer Borg Patrick Groenen Modern Multidimensional Scaling Theory and Applications With 116 Figures Springer Contents Preface vii I Fundamentals of MDS 1 1 The Four Purposes of Multidimensional Scaling

More information

Week 7 Picturing Network. Vahe and Bethany

Week 7 Picturing Network. Vahe and Bethany Week 7 Picturing Network Vahe and Bethany Freeman (2005) - Graphic Techniques for Exploring Social Network Data The two main goals of analyzing social network data are identification of cohesive groups

More information

A Brief Look at Optimization

A Brief Look at Optimization A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year s version Overview Introduction Classes of optimization problems Linear programming Steepest

More information

Finite Math Linear Programming 1 May / 7

Finite Math Linear Programming 1 May / 7 Linear Programming Finite Math 1 May 2017 Finite Math Linear Programming 1 May 2017 1 / 7 General Description of Linear Programming Finite Math Linear Programming 1 May 2017 2 / 7 General Description of

More information

Algorithms: Decision Trees

Algorithms: Decision Trees Algorithms: Decision Trees A small dataset: Miles Per Gallon Suppose we want to predict MPG From the UCI repository A Decision Stump Recursion Step Records in which cylinders = 4 Records in which cylinders

More information

FEATURE SELECTION TECHNIQUES

FEATURE SELECTION TECHNIQUES CHAPTER-2 FEATURE SELECTION TECHNIQUES 2.1. INTRODUCTION Dimensionality reduction through the choice of an appropriate feature subset selection, results in multiple uses including performance upgrading,

More information

STATISTICS FOR PSYCHOLOGISTS

STATISTICS FOR PSYCHOLOGISTS STATISTICS FOR PSYCHOLOGISTS SECTION: JAMOVI CHAPTER: USING THE SOFTWARE Section Abstract: This section provides step-by-step instructions on how to obtain basic statistical output using JAMOVI, both visually

More information

Dependency detection with Bayesian Networks

Dependency detection with Bayesian Networks Dependency detection with Bayesian Networks M V Vikhreva Faculty of Computational Mathematics and Cybernetics, Lomonosov Moscow State University, Leninskie Gory, Moscow, 119991 Supervisor: A G Dyakonov

More information

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures: Homework Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression 3.0-3.2 Pod-cast lecture on-line Next lectures: I posted a rough plan. It is flexible though so please come with suggestions Bayes

More information

Dimension Induced Clustering

Dimension Induced Clustering Dimension Induced Clustering Aris Gionis Alexander Hinneburg Spiros Papadimitriou Panayiotis Tsaparas HIIT, University of Helsinki Martin Luther University, Halle Carnegie Melon University HIIT, University

More information

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis CHAPTER 3 BEST FIRST AND GREEDY SEARCH BASED CFS AND NAÏVE BAYES ALGORITHMS FOR HEPATITIS DIAGNOSIS 3.1 Introduction

More information

Inferring Regulatory Networks by Combining Perturbation Screens and Steady State Gene Expression Profiles

Inferring Regulatory Networks by Combining Perturbation Screens and Steady State Gene Expression Profiles Supporting Information to Inferring Regulatory Networks by Combining Perturbation Screens and Steady State Gene Expression Profiles Ali Shojaie,#, Alexandra Jauhiainen 2,#, Michael Kallitsis 3,#, George

More information

Note Set 4: Finite Mixture Models and the EM Algorithm

Note Set 4: Finite Mixture Models and the EM Algorithm Note Set 4: Finite Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine Finite Mixture Models A finite mixture model with K components, for

More information

RANSAC and some HOUGH transform

RANSAC and some HOUGH transform RANSAC and some HOUGH transform Thank you for the slides. They come mostly from the following source Dan Huttenlocher Cornell U Matching and Fitting Recognition and matching are closely related to fitting

More information

Automated Bioinformatics Analysis System on Chip ABASOC. version 1.1

Automated Bioinformatics Analysis System on Chip ABASOC. version 1.1 Automated Bioinformatics Analysis System on Chip ABASOC version 1.1 Phillip Winston Miller, Priyam Patel, Daniel L. Johnson, PhD. University of Tennessee Health Science Center Office of Research Molecular

More information

Package svapls. February 20, 2015

Package svapls. February 20, 2015 Package svapls February 20, 2015 Type Package Title Surrogate variable analysis using partial least squares in a gene expression study. Version 1.4 Date 2013-09-19 Author Sutirtha Chakraborty, Somnath

More information

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors Dejan Sarka Anomaly Detection Sponsors About me SQL Server MVP (17 years) and MCT (20 years) 25 years working with SQL Server Authoring 16 th book Authoring many courses, articles Agenda Introduction Simple

More information

Artificial Intelligence for Robotics: A Brief Summary

Artificial Intelligence for Robotics: A Brief Summary Artificial Intelligence for Robotics: A Brief Summary This document provides a summary of the course, Artificial Intelligence for Robotics, and highlights main concepts. Lesson 1: Localization (using Histogram

More information

Predict Outcomes and Reveal Relationships in Categorical Data

Predict Outcomes and Reveal Relationships in Categorical Data PASW Categories 18 Specifications Predict Outcomes and Reveal Relationships in Categorical Data Unleash the full potential of your data through predictive analysis, statistical learning, perceptual mapping,

More information

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques 24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 15: Microarray clustering http://compbio.pbworks.com/f/wood2.gif Some slides were adapted from Dr. Shaojie Zhang (University of Central Florida) Microarray

More information

Vertical decomposition of a lattice using clique separators

Vertical decomposition of a lattice using clique separators Vertical decomposition of a lattice using clique separators Anne Berry, Romain Pogorelcnik, Alain Sigayret LIMOS UMR CNRS 6158 Ensemble Scientifique des Cézeaux Université Blaise Pascal, F-63 173 Aubière,

More information

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford Department of Engineering Science University of Oxford January 27, 2017 Many datasets consist of multiple heterogeneous subsets. Cluster analysis: Given an unlabelled data, want algorithms that automatically

More information

Robust image recovery via total-variation minimization

Robust image recovery via total-variation minimization Robust image recovery via total-variation minimization Rachel Ward University of Texas at Austin (Joint work with Deanna Needell, Claremont McKenna College) February 16, 2012 2 Images are compressible

More information

Information-based Biclustering for the Analysis of Multivariate Time Series Data

Information-based Biclustering for the Analysis of Multivariate Time Series Data Information-based Biclustering for the Analysis of Multivariate Time Series Data Kevin Casey Courant Institute of Mathematical Sciences, New York University, NY 10003 August 6, 2007 1 Abstract A wide variety

More information

Turing Workshop on Statistics of Network Analysis

Turing Workshop on Statistics of Network Analysis Turing Workshop on Statistics of Network Analysis Day 1: 29 May 9:30-10:00 Registration & Coffee 10:00-10:45 Eric Kolaczyk Title: On the Propagation of Uncertainty in Network Summaries Abstract: While

More information

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai Decision Trees Decision Tree Decision Trees (DTs) are a nonparametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target

More information

Information Driven Healthcare:

Information Driven Healthcare: Information Driven Healthcare: Machine Learning course Lecture: Feature selection I --- Concepts Centre for Doctoral Training in Healthcare Innovation Dr. Athanasios Tsanas ( Thanasis ), Wellcome Trust

More information

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer Model Assessment and Selection Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Model Training data Testing data Model Testing error rate Training error

More information

Multidimensional scaling Based in part on slides from textbook, slides of Susan Holmes. October 10, Statistics 202: Data Mining

Multidimensional scaling Based in part on slides from textbook, slides of Susan Holmes. October 10, Statistics 202: Data Mining Multidimensional scaling Based in part on slides from textbook, slides of Susan Holmes October 10, 2012 1 / 1 Multidimensional scaling A visual tool Recall the PCA scores were X V = U where X = HX S 1/2

More information

Analyzing Large Biological Datasets with an Improved. Algorithm for MIC

Analyzing Large Biological Datasets with an Improved. Algorithm for MIC Analyzing Large Biological Datasets with an Improved Algorithm for MIC Shuliang Wang, Yiping Zhao School of Software, Beijing Institute of Technology, Beijing 100081, China Abstract: A computational framework

More information

Feature Selection for Image Retrieval and Object Recognition

Feature Selection for Image Retrieval and Object Recognition Feature Selection for Image Retrieval and Object Recognition Nuno Vasconcelos et al. Statistical Visual Computing Lab ECE, UCSD Presented by Dashan Gao Scalable Discriminant Feature Selection for Image

More information

On Fuzzy Topological Spaces Involving Boolean Algebraic Structures

On Fuzzy Topological Spaces Involving Boolean Algebraic Structures Journal of mathematics and computer Science 15 (2015) 252-260 On Fuzzy Topological Spaces Involving Boolean Algebraic Structures P.K. Sharma Post Graduate Department of Mathematics, D.A.V. College, Jalandhar

More information

Literature Review On Implementing Binary Knapsack problem

Literature Review On Implementing Binary Knapsack problem Literature Review On Implementing Binary Knapsack problem Ms. Niyati Raj, Prof. Jahnavi Vitthalpura PG student Department of Information Technology, L.D. College of Engineering, Ahmedabad, India Assistant

More information

Topology - I. Michael Shulman WOMP 2004

Topology - I. Michael Shulman WOMP 2004 Topology - I Michael Shulman WOMP 2004 1 Topological Spaces There are many different ways to define a topological space; the most common one is as follows: Definition 1.1 A topological space (often just

More information

Image Coding and Data Compression

Image Coding and Data Compression Image Coding and Data Compression Biomedical Images are of high spatial resolution and fine gray-scale quantisiation Digital mammograms: 4,096x4,096 pixels with 12bit/pixel 32MB per image Volume data (CT

More information

Leave-One-Out Support Vector Machines

Leave-One-Out Support Vector Machines Leave-One-Out Support Vector Machines Jason Weston Department of Computer Science Royal Holloway, University of London, Egham Hill, Egham, Surrey, TW20 OEX, UK. Abstract We present a new learning algorithm

More information

Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets

Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets Konstantinos Sechidis School of Computer Science University of Manchester sechidik@cs.man.ac.uk Abstract

More information

Interpretations and Models. Chapter Axiomatic Systems and Incidence Geometry

Interpretations and Models. Chapter Axiomatic Systems and Incidence Geometry Interpretations and Models Chapter 2.1-2.4 - Axiomatic Systems and Incidence Geometry Axiomatic Systems in Mathematics The gold standard for rigor in an area of mathematics Not fully achieved in most areas

More information

Auxiliary Variational Information Maximization for Dimensionality Reduction

Auxiliary Variational Information Maximization for Dimensionality Reduction Auxiliary Variational Information Maximization for Dimensionality Reduction Felix Agakov 1 and David Barber 2 1 University of Edinburgh, 5 Forrest Hill, EH1 2QL Edinburgh, UK felixa@inf.ed.ac.uk, www.anc.ed.ac.uk

More information

Feature extraction. Bi-Histogram Binarization Entropy. What is texture Texture primitives. Filter banks 2D Fourier Transform Wavlet maxima points

Feature extraction. Bi-Histogram Binarization Entropy. What is texture Texture primitives. Filter banks 2D Fourier Transform Wavlet maxima points Feature extraction Bi-Histogram Binarization Entropy What is texture Texture primitives Filter banks 2D Fourier Transform Wavlet maxima points Edge detection Image gradient Mask operators Feature space

More information

AMS /672: Graph Theory Homework Problems - Week V. Problems to be handed in on Wednesday, March 2: 6, 8, 9, 11, 12.

AMS /672: Graph Theory Homework Problems - Week V. Problems to be handed in on Wednesday, March 2: 6, 8, 9, 11, 12. AMS 550.47/67: Graph Theory Homework Problems - Week V Problems to be handed in on Wednesday, March : 6, 8, 9,,.. Assignment Problem. Suppose we have a set {J, J,..., J r } of r jobs to be filled by a

More information

Part I. Graphical exploratory data analysis. Graphical summaries of data. Graphical summaries of data

Part I. Graphical exploratory data analysis. Graphical summaries of data. Graphical summaries of data Week 3 Based in part on slides from textbook, slides of Susan Holmes Part I Graphical exploratory data analysis October 10, 2012 1 / 1 2 / 1 Graphical summaries of data Graphical summaries of data Exploratory

More information

Intrinsic Dimensionality Estimation for Data Sets

Intrinsic Dimensionality Estimation for Data Sets Intrinsic Dimensionality Estimation for Data Sets Yoon-Mo Jung, Jason Lee, Anna V. Little, Mauro Maggioni Department of Mathematics, Duke University Lorenzo Rosasco Center for Biological and Computational

More information

Multidimensional Visualization and Clustering

Multidimensional Visualization and Clustering Multidimensional Visualization and Clustering Presentation for Visual Analytics of Professor Klaus Mueller Xiaotian (Tim) Yin 04-26 26-20072007 Paper List HD-Eye: Visual Mining of High-Dimensional Data

More information

AM 221: Advanced Optimization Spring 2016

AM 221: Advanced Optimization Spring 2016 AM 221: Advanced Optimization Spring 2016 Prof. Yaron Singer Lecture 2 Wednesday, January 27th 1 Overview In our previous lecture we discussed several applications of optimization, introduced basic terminology,

More information

Drawing Semi-bipartite Graphs in Anchor+Matrix Style

Drawing Semi-bipartite Graphs in Anchor+Matrix Style 2011 15th International Conference on Information Visualisation Drawing Semi-bipartite Graphs in Anchor+Matrix Style Kazuo Misue and Qi Zhou Department of Computer Science, University of Tsukuba Tsukuba,

More information

Statistical matching: conditional. independence assumption and auxiliary information

Statistical matching: conditional. independence assumption and auxiliary information Statistical matching: conditional Training Course Record Linkage and Statistical Matching Mauro Scanu Istat scanu [at] istat.it independence assumption and auxiliary information Outline The conditional

More information

Clustering with Reinforcement Learning

Clustering with Reinforcement Learning Clustering with Reinforcement Learning Wesam Barbakh and Colin Fyfe, The University of Paisley, Scotland. email:wesam.barbakh,colin.fyfe@paisley.ac.uk Abstract We show how a previously derived method of

More information

08 An Introduction to Dense Continuous Robotic Mapping

08 An Introduction to Dense Continuous Robotic Mapping NAVARCH/EECS 568, ROB 530 - Winter 2018 08 An Introduction to Dense Continuous Robotic Mapping Maani Ghaffari March 14, 2018 Previously: Occupancy Grid Maps Pose SLAM graph and its associated dense occupancy

More information

Data analysis using Microsoft Excel

Data analysis using Microsoft Excel Introduction to Statistics Statistics may be defined as the science of collection, organization presentation analysis and interpretation of numerical data from the logical analysis. 1.Collection of Data

More information

Communication Complexity and Parallel Computing

Communication Complexity and Parallel Computing Juraj Hromkovic Communication Complexity and Parallel Computing With 40 Figures Springer Table of Contents 1 Introduction 1 1.1 Motivation and Aims 1 1.2 Concept and Organization 4 1.3 How to Read the

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu /2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 2 Task: Given a large number (N in the millions or

More information

10. Clustering. Introduction to Bioinformatics Jarkko Salojärvi. Based on lecture slides by Samuel Kaski

10. Clustering. Introduction to Bioinformatics Jarkko Salojärvi. Based on lecture slides by Samuel Kaski 10. Clustering Introduction to Bioinformatics 30.9.2008 Jarkko Salojärvi Based on lecture slides by Samuel Kaski Definition of a cluster Typically either 1. A group of mutually similar samples, or 2. A

More information

Sparse Matrices Reordering using Evolutionary Algorithms: A Seeded Approach

Sparse Matrices Reordering using Evolutionary Algorithms: A Seeded Approach 1 Sparse Matrices Reordering using Evolutionary Algorithms: A Seeded Approach David Greiner, Gustavo Montero, Gabriel Winter Institute of Intelligent Systems and Numerical Applications in Engineering (IUSIANI)

More information

Excel Scientific and Engineering Cookbook

Excel Scientific and Engineering Cookbook Excel Scientific and Engineering Cookbook David M. Bourg O'REILLY* Beijing Cambridge Farnham Koln Paris Sebastopol Taipei Tokyo Preface xi 1. Using Excel 1 1.1 Navigating the Interface 1 1.2 Entering Data

More information

ORIE 6300 Mathematical Programming I September 2, Lecture 3

ORIE 6300 Mathematical Programming I September 2, Lecture 3 ORIE 6300 Mathematical Programming I September 2, 2014 Lecturer: David P. Williamson Lecture 3 Scribe: Divya Singhvi Last time we discussed how to take dual of an LP in two different ways. Today we will

More information

3 Feature Selection & Feature Extraction

3 Feature Selection & Feature Extraction 3 Feature Selection & Feature Extraction Overview: 3.1 Introduction 3.2 Feature Extraction 3.3 Feature Selection 3.3.1 Max-Dependency, Max-Relevance, Min-Redundancy 3.3.2 Relevance Filter 3.3.3 Redundancy

More information

Can we quantify the hardness of learning manipulation? Kris Hauser Department of Electrical and Computer Engineering Duke University

Can we quantify the hardness of learning manipulation? Kris Hauser Department of Electrical and Computer Engineering Duke University Can we quantify the hardness of learning manipulation? Kris Hauser Department of Electrical and Computer Engineering Duke University Robot Learning! Robot Learning! Google used 14 identical robots 800,000

More information

Modern Multidimensional Scaling

Modern Multidimensional Scaling Ingwer Borg Patrick J.F. Groenen Modern Multidimensional Scaling Theory and Applications Second Edition With 176 Illustrations ~ Springer Preface vii I Fundamentals of MDS 1 1 The Four Purposes of Multidimensional

More information

Features: representation, normalization, selection. Chapter e-9

Features: representation, normalization, selection. Chapter e-9 Features: representation, normalization, selection Chapter e-9 1 Features Distinguish between instances (e.g. an image that you need to classify), and the features you create for an instance. Features

More information

Synthetic Geometry. 1.1 Foundations 1.2 The axioms of projective geometry

Synthetic Geometry. 1.1 Foundations 1.2 The axioms of projective geometry Synthetic Geometry 1.1 Foundations 1.2 The axioms of projective geometry Foundations Def: A geometry is a pair G = (Ω, I), where Ω is a set and I a relation on Ω that is symmetric and reflexive, i.e. 1.

More information

Semi supervised clustering for Text Clustering

Semi supervised clustering for Text Clustering Semi supervised clustering for Text Clustering N.Saranya 1 Assistant Professor, Department of Computer Science and Engineering, Sri Eshwar College of Engineering, Coimbatore 1 ABSTRACT: Based on clustering

More information

15.082J and 6.855J. Lagrangian Relaxation 2 Algorithms Application to LPs

15.082J and 6.855J. Lagrangian Relaxation 2 Algorithms Application to LPs 15.082J and 6.855J Lagrangian Relaxation 2 Algorithms Application to LPs 1 The Constrained Shortest Path Problem (1,10) 2 (1,1) 4 (2,3) (1,7) 1 (10,3) (1,2) (10,1) (5,7) 3 (12,3) 5 (2,2) 6 Find the shortest

More information

Distance-based Methods: Drawbacks

Distance-based Methods: Drawbacks Distance-based Methods: Drawbacks Hard to find clusters with irregular shapes Hard to specify the number of clusters Heuristic: a cluster must be dense Jian Pei: CMPT 459/741 Clustering (3) 1 How to Find

More information

Persistent Homology and Nested Dissection

Persistent Homology and Nested Dissection Persistent Homology and Nested Dissection Don Sheehy University of Connecticut joint work with Michael Kerber and Primoz Skraba A Topological Data Analysis Pipeline A Topological Data Analysis Pipeline

More information

Analyzing Large Biological Datasets in Bioinformatics with Maximal Information Coefficient

Analyzing Large Biological Datasets in Bioinformatics with Maximal Information Coefficient IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, MANUSCRIPT ID 1 Analyzing Large Biological Datasets in Bioinformatics with Maximal Information Coefficient Chao Wang, Member IEEE, Xi Li,

More information

arxiv: v2 [cs.lg] 14 Aug 2013

arxiv: v2 [cs.lg] 14 Aug 2013 Equitability Analysis of the Maximal Information Coefficient, with Comparisons arxiv:3.634v2 [cs.lg] 4 Aug 23 David N. Reshef Department of Electrical Engineering and Computer Science Harvard-MIT Division

More information

Dimension reduction : PCA and Clustering

Dimension reduction : PCA and Clustering Dimension reduction : PCA and Clustering By Hanne Jarmer Slides by Christopher Workman Center for Biological Sequence Analysis DTU The DNA Array Analysis Pipeline Array design Probe design Question Experimental

More information

A Dendrogram. Bioinformatics (Lec 17)

A Dendrogram. Bioinformatics (Lec 17) A Dendrogram 3/15/05 1 Hierarchical Clustering [Johnson, SC, 1967] Given n points in R d, compute the distance between every pair of points While (not done) Pick closest pair of points s i and s j and

More information

DISCRETE MATHEMATICS

DISCRETE MATHEMATICS DISCRETE MATHEMATICS WITH APPLICATIONS THIRD EDITION SUSANNA S. EPP DePaul University THOIVISON * BROOKS/COLE Australia Canada Mexico Singapore Spain United Kingdom United States CONTENTS Chapter 1 The

More information

Package nettools. August 29, 2016

Package nettools. August 29, 2016 Package nettools August 29, 2016 Type Package Title A Network Comparison Framework Version 1.0.1 Date 2014-09-02 Maintainer Michele Filosi Depends R (>= 2.14.1), methods Imports parallel,

More information

Clustering. Unsupervised Learning

Clustering. Unsupervised Learning Clustering. Unsupervised Learning Maria-Florina Balcan 04/06/2015 Reading: Chapter 14.3: Hastie, Tibshirani, Friedman. Additional resources: Center Based Clustering: A Foundational Perspective. Awasthi,

More information

Approximate Level-Crossing Probabilities for Interactive Visualization of Uncertain Isocontours

Approximate Level-Crossing Probabilities for Interactive Visualization of Uncertain Isocontours Approximate Level-Crossing Probabilities Kai Pöthkow, Christoph Petz & Hans-Christian Hege Working with Uncertainty Workshop, VisWeek 2011 Zuse Institute Berlin Previous Work Pfaffelmoser, Reitinger &

More information

Scott Smith Advanced Image Processing March 15, Speeded-Up Robust Features SURF

Scott Smith Advanced Image Processing March 15, Speeded-Up Robust Features SURF Scott Smith Advanced Image Processing March 15, 2011 Speeded-Up Robust Features SURF Overview Why SURF? How SURF works Feature detection Scale Space Rotational invariance Feature vectors SURF vs Sift Assumptions

More information

Copyright 2007 Pearson Addison-Wesley. All rights reserved. A. Levitin Introduction to the Design & Analysis of Algorithms, 2 nd ed., Ch.

Copyright 2007 Pearson Addison-Wesley. All rights reserved. A. Levitin Introduction to the Design & Analysis of Algorithms, 2 nd ed., Ch. Iterative Improvement Algorithm design technique for solving optimization problems Start with a feasible solution Repeat the following step until no improvement can be found: change the current feasible

More information

Practical OmicsFusion

Practical OmicsFusion Practical OmicsFusion Introduction In this practical, we will analyse data, from an experiment which aim was to identify the most important metabolites that are related to potato flesh colour, from an

More information

Missing Data Analysis for the Employee Dataset

Missing Data Analysis for the Employee Dataset Missing Data Analysis for the Employee Dataset 67% of the observations have missing values! Modeling Setup For our analysis goals we would like to do: Y X N (X, 2 I) and then interpret the coefficients

More information