Visually-Motivated Characterizations of Point Sets Embedded in High-Dimensional Geometric Spaces: Current Projects

Size: px
Start display at page:

Download "Visually-Motivated Characterizations of Point Sets Embedded in High-Dimensional Geometric Spaces: Current Projects"

Transcription

1 Visually-Motivated Characterizations of Point Sets Embedded in High-Dimensional Geometric Spaces: Current Projects Leland Wilkinson SYSTAT University of Illinois at Chicago (Computer Science) Northwestern University (Statistics) Robert Grossman University of Illinois at Chicago (Computer Science) Adilson Motter Northwestern University (Physics) 1

2 Subprojects Scagnostics Classification Venn/Euler Diagrams Treemaps 2

3 Scagnostics Wilkinson, Anand, and Grossman (2006) characterize a scatterplot (2D point set) with nine measures. We base our measures on three geometric graphs. Convex Hull Alpha Shape Minimum Spanning Tree 3

4 Each geometric graph is a subset of the Delaunay Triangulation 4

5 Convex Hull 5

6 Alpha Shape 6

7 Minimum Spanning Tree 7

8 Shape Convex: area of alpha shape divided by area of convex hull Skinny: ratio of perimeter to area of the alpha shape Stringy: ratio of 2-degree vertices in MST to number of vertices > 1-degree 8

9 Trend Monotonic: squared Spearman correlation coefficient 9

10 Density Skewed: ratio of (Q90 - Q50) / (Q90 - Q10), where quantiles are on MST edge lengths Clumpy: 1 minus the ratio of the longest edge in the largest runt (blue) to the length of runt-cutting edge (red) longest edge in runt largest runt Outlying: proportion of total MST length due to edges adjacent to outliers 10

11 Density Sparse: 90th percentile of distribution of edge lengths in MST Striated: proportion of all vertices in the MST that are degree-2 and have a cosine between adjacent edges less than

12 Software (Wilkinson and Anand) 12

13 Fishing Expeditions in Visual Analytics We have used the empirical distribution of Scagnostics (Wilkinson and Wills, JCGS, 2008), the False Discovery Rate (FDR) statistic, and automated statistical modeling to develop algorithms for reducing false discoveries when people use visual-analytic software (Wilkinson and Wills, IVS, 2009). 40 Mean Elevation Temp(F) Jan Feb Mar Apr May Jun Jul Aug Sep Oct Day Temperature Day These data are fake. They are a random walk produced by a random number generator. These data are real. They are temperature measurements of a cow over 80-days. The data are periodic. 13

14 Distribution of Scagnostics Monotonic Stringy Skinny Convex Striated Sparse Clumpy Skewed Outlying Measure 14

15 False Discovery Rate (FDR) Benjamini and Hochberg, JRSS, 1995 Sort a list of p values and define p0 = 0 P = [p 1,..., p m ] Compute BH index Use adjusted p value (indexed by rbh) as cutoff Based on uniform distribution of p values, so tests may be heterogeneous 15

16 Scagnostics on Graphs Adilson Motter is developing scagnostics for (V, E) graphs. This effort is similar to what we did for scatterplots. One wants a relatively small set of measures that are relatively independent and that nicely characterize a wide universe of real directed and undirected graphs. The goal is to classify real graphs, recognize anomalies, and locate exemplars in subgraphs of large graphs. 16

17 Classification Based on our analysis of how people visually classify 2D point sets, we have developed a high-dimensional, linear complexity, supervised classifier called Linf. This classifier, using the L-infinity metric, rivals or outperforms leading classifiers on 10 difficult datasets. 17

18 Visual Classification Anand, Wilkinson, and Tuan (ICDM, 2009) developed a visual classifier to analyze how people distinguish target point sets from background point sets. We designed the software to behave like a video game. 18

19 Description Regions Weighted L-infinity norm x = sup(w 1 x 1, w 2 x 2,... w n x n ) Hypercube Description Region (HDR) The set of points less than a fixed distance from a single point using the L-infinity norm. Composite Hypercube Description Region (CHDR) Union of HDRs. 19

20 The Linf Algorithm 1.Transform t(x) = sgn(x)sqrt(abs(x)) 2.Project 3.Bin Pick next target class (one against all). Compute 25 random, integer-weighted, 2D projections. Pick best 5 projections on a class-separation measure. b = 2log 2 (n) Compute purity of target class instances in each bin. 4.Cover j j j j i 5.Repeat Store best cover in scoring list. Repeat steps 2-4 until no training instances left. 20

21 Results waveform S R L B T B R S L T vowel R T SB L B T R S L Dataset spect shuttle satellite orange10 optdigits L R TL SB T R L B S R LB T T SRL B R S B T S R T S B B B B T S L B S T R L R S T R T R S L L L Naive Bayes Support Vector Machine Classification Tree Random Forest Linf madelon T L B S R B R T L S Standardized Error cancer L S T R B BR T S L adult SR T LB B T R L S B : Naive Bayes L : Linf R : Random Forest S : Support Vector Machine T : Classification Tree Error (%) Time (sec) 21

22 Venn/Euler Diagrams How do we fit a Venn/Euler diagram to empirical data involving more than 3 sets? These diagrams are widely used in the bioinformatics community. We have developed an R function called venneuler() and have posted it on CRAN for general use. Euler diagram for 11 animals based on gene lists from the Agilent DNA oligo microarray database. The analysis was based on 247,412 gene symbols and 11 animal names. The stress for this solution is.06, with corresponding correlation of 0.97 between circle and intersection areas and counts of genes. The computation was completed in under 30 seconds on a 2.5 GHz MacBook Pro running the Java 1.5 Virtual Machine in 2GB of allocated memory. Mouse Rat Frog Pig Sheep Rabbit Horse Human Chicken Cow Dog 22

23 The venneuler() Algorithm 1.Make list of m intersections among n sets (m = 2 n ) 2.Make list of disk intersections (size disks proportional to Xi ). 3.Make list of disjoint counts and list of disjoint areas. 4.Estimate model. 5.Move disks and repeat 2-5 to minimize error, using gradient: 23

24 Calculating Areas

25 Treemaps We are investigating the ordering of treemaps. 25

26 Seriating a Tree Canada Italy France Sweden WGermany Belgium Denmark UK Austria Switzerland Czechoslov Poland Cuba Chile Argentina Uruguay Panama Colombia Peru Brazil Ecuador ElSalvador Guatemala Iraq Libya Ethiopia Haiti Somalia Afghanistan Guinea Permuted Data Matrix URBAN LITERACY LIFEEXPF LIFEEXPM GDP_CAP HEALTH EDUC MIL DEATH_RT BABYMORT BIRTH_RT B_TO_D Wilkinson and Friendly, TAS,

27 Fisher-Anderson Iris Data Petal Length + Petal Width Sepal Length + Sepal Width Setosa Versicolor Virginica Species X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X19 X20 X21 X22 X23 X24 X25 X26 X27 X28 X29 X30 X31 X32 X33 X34 X35 X36 X37 X38 X39 X41 X42 X43 X44 X45 X46 X47 X48 X49 X50 X51 X52 X53 X54 X55 X56 X57 X58 X59 X60 X61 X62 X63 X64 X65 X66 X67 X68 X69 X70 X71 X72 X73 X74 X75 X76 X77 X78 X79 X80 X81 X82 X83 X84 X85 X86 X87 X88 X89 X90 X91 X92 X93 X94 X95 X96 X97 X98 X99 X100 X101 X103 X104 X105 X106 X107 X109 X110 X111 X112 X113 X114 X115 X116 X118 X119 X120 X122 X123 X124 X125 X126 X127 X128 X130 X132 X134 X135 X136 X137 X139 X140 X141 X142 X145 X146 X147 X148 X149 X150 X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X19 X20 X21 X22 X23 X24 X25 X26 X27 X28 X29 X30 X31 X32 X33 X34 X35 X36 X37 X38 X39 X41 X42 X43 X44 X45 X46 X47 X48 X49 X50 X51 X52 X53 X54 X55 X56 X57 X58 X59 X60 X61 X62 X63 X64 X65 X66 X67 X68 X69 X70 X71 X72 X73 X74 X75 X76 X77 X78 X79 X80 X81 X82 X83 X84 X85 X86 X87 X88 X89 X90 X91 X92 X93 X94 X95 X96 X97 X98 X99 X100 X100 X101 X103 X104 X105 X106 X107 X109 X110 X111 X112 X113 X114 X115 X116 X118 X119 X120 X122 X123 X124 X125 X126 X127 X128 X130 X132 X134 X135 X136 X137 X139 X140 X141 X142 X145 X146 X147 X148 X149 X150

FODAVA Partners Leland Wilkinson (SYSTAT & UIC) Robert Grossman (UIC) Adilson Motter (Northwestern) Anushka Anand, Troy Hernandez (UIC)

FODAVA Partners Leland Wilkinson (SYSTAT & UIC) Robert Grossman (UIC) Adilson Motter (Northwestern) Anushka Anand, Troy Hernandez (UIC) FODAVA Partners Leland Wilkinson (SYSTAT & UIC) Robert Grossman (UIC) Adilson Motter (Northwestern) Anushka Anand, Troy Hernandez (UIC) Visually-Motivated Characterizations of Point Sets Embedded in High-Dimensional

More information

FODAVA Partners Leland Wilkinson (SYSTAT, Advise Analytics & UIC) Anushka Anand, Tuan Nhon Dang (UIC)

FODAVA Partners Leland Wilkinson (SYSTAT, Advise Analytics & UIC) Anushka Anand, Tuan Nhon Dang (UIC) FODAVA Partners Leland Wilkinson (SYSTAT, Advise Analytics & UIC) Anushka Anand, Tuan Nhon Dang (UIC) Anomaly Discovery through Visual Characterizations of Point Sets Embedded in High- Dimensional Geometric

More information

Evgeny Maksakov Advantages and disadvantages: Advantages and disadvantages: Advantages and disadvantages: Advantages and disadvantages:

Evgeny Maksakov Advantages and disadvantages: Advantages and disadvantages: Advantages and disadvantages: Advantages and disadvantages: Today Problems with visualizing high dimensional data Problem Overview Direct Visualization Approaches High dimensionality Visual cluttering Clarity of representation Visualization is time consuming Dimensional

More information

Timeseer: Detecting interesting distributions in multiple time series data

Timeseer: Detecting interesting distributions in multiple time series data Timeseer: Detecting interesting distributions in multiple time series data Tuan Nhon Dang University of Illinois at Chicago tdang@cs.uic.edu Leland Wilkinson University of Illinois at Chicago leland.wilkinson@systat.com

More information

Lecture 6: Statistical Graphics

Lecture 6: Statistical Graphics Lecture 6: Statistical Graphics Information Visualization CPSC 533C, Fall 2009 Tamara Munzner UBC Computer Science Mon, 28 September 2009 1 / 34 Readings Covered Multi-Scale Banking to 45 Degrees. Jeffrey

More information

Visualizing high-dimensional data:

Visualizing high-dimensional data: Visualizing high-dimensional data: Applying graph theory to data visualization Wayne Oldford based on joint work with Catherine Hurley (Maynooth, Ireland) Adrian Waddell (Waterloo, Canada) Challenge p

More information

Data Mining: Exploring Data. Lecture Notes for Chapter 3

Data Mining: Exploring Data. Lecture Notes for Chapter 3 Data Mining: Exploring Data Lecture Notes for Chapter 3 Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler Look for accompanying R code on the course web site. Topics Exploratory Data Analysis

More information

Lecture 3: Data Principles

Lecture 3: Data Principles Lecture 3: Data Principles Information Visualization CPSC 533C, Fall 2011 Tamara Munzner UBC Computer Science Mon, 19 September 2011 1 / 33 Papers Covered Chapter 2: Data Principles Polaris: A System for

More information

Data Mining: Exploring Data. Lecture Notes for Chapter 3

Data Mining: Exploring Data. Lecture Notes for Chapter 3 Data Mining: Exploring Data Lecture Notes for Chapter 3 1 What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include

More information

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar What is data exploration? A preliminary exploration of the data to better understand its characteristics.

More information

MATH5745 Multivariate Methods Lecture 13

MATH5745 Multivariate Methods Lecture 13 MATH5745 Multivariate Methods Lecture 13 April 24, 2018 MATH5745 Multivariate Methods Lecture 13 April 24, 2018 1 / 33 Cluster analysis. Example: Fisher iris data Fisher (1936) 1 iris data consists of

More information

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining Data Mining: Exploring Data Lecture Notes for Data Exploration Chapter Introduction to Data Mining by Tan, Steinbach, Karpatne, Kumar 02/03/2018 Introduction to Data Mining 1 What is data exploration?

More information

Scagnostics Distributions

Scagnostics Distributions Scagnostics Distributions Leland WILKINSON and Graham WILLS Scagnostics is a Tukey neologism for the term scatterplot diagnostics. Scagnostics are characterizations of the 2D distributions of orthogonal

More information

CS 584 Data Mining. Classification 1

CS 584 Data Mining. Classification 1 CS 584 Data Mining Classification 1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class. Find a model for

More information

BL5229: Data Analysis with Matlab Lab: Learning: Clustering

BL5229: Data Analysis with Matlab Lab: Learning: Clustering BL5229: Data Analysis with Matlab Lab: Learning: Clustering The following hands-on exercises were designed to teach you step by step how to perform and understand various clustering algorithm. We will

More information

Cluster Analysis and Visualization. Workshop on Statistics and Machine Learning 2004/2/6

Cluster Analysis and Visualization. Workshop on Statistics and Machine Learning 2004/2/6 Cluster Analysis and Visualization Workshop on Statistics and Machine Learning 2004/2/6 Outlines Introduction Stages in Clustering Clustering Analysis and Visualization One/two-dimensional Data Histogram,

More information

k-nearest Neighbors + Model Selection

k-nearest Neighbors + Model Selection 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University k-nearest Neighbors + Model Selection Matt Gormley Lecture 5 Jan. 30, 2019 1 Reminders

More information

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS 1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes and a class attribute

More information

The Curse of Dimensionality

The Curse of Dimensionality The Curse of Dimensionality ACAS 2002 p1/66 Curse of Dimensionality The basic idea of the curse of dimensionality is that high dimensional data is difficult to work with for several reasons: Adding more

More information

Model Selection Introduction to Machine Learning. Matt Gormley Lecture 4 January 29, 2018

Model Selection Introduction to Machine Learning. Matt Gormley Lecture 4 January 29, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Model Selection Matt Gormley Lecture 4 January 29, 2018 1 Q&A Q: How do we deal

More information

Hsiaochun Hsu Date: 12/12/15. Support Vector Machine With Data Reduction

Hsiaochun Hsu Date: 12/12/15. Support Vector Machine With Data Reduction Support Vector Machine With Data Reduction 1 Table of Contents Summary... 3 1. Introduction of Support Vector Machines... 3 1.1 Brief Introduction of Support Vector Machines... 3 1.2 SVM Simple Experiment...

More information

University of Florida CISE department Gator Engineering. Visualization

University of Florida CISE department Gator Engineering. Visualization Visualization Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida What is visualization? Visualization is the process of converting data (information) in to

More information

An L Norm Visual Classifier

An L Norm Visual Classifier An L Norm Visual Classifier Anushka Anand University of Illinois at Chicago Chicago, IL aanand2@lac.uic.edu Leland Wilkinson SYSTAT Inc. Chicago, IL leland.wilkinson@systat.com Dang Nhon Tuan University

More information

CISC 4631 Data Mining

CISC 4631 Data Mining CISC 4631 Data Mining Lecture 03: Introduction to classification Linear classifier Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside) 1 Classification:

More information

GRAPHICAL ALGORITHMS. UNIT _II Lecture-12 Slides No. 3-7 Lecture Slides No Lecture Slides No

GRAPHICAL ALGORITHMS. UNIT _II Lecture-12 Slides No. 3-7 Lecture Slides No Lecture Slides No GRAPHICAL ALGORITHMS UNIT _II Lecture-12 Slides No. 3-7 Lecture-13-16 Slides No. 8-26 Lecture-17-19 Slides No. 27-42 Topics Covered Graphs & Trees ( Some Basic Terminologies) Spanning Trees (BFS & DFS)

More information

Input: Concepts, Instances, Attributes

Input: Concepts, Instances, Attributes Input: Concepts, Instances, Attributes 1 Terminology Components of the input: Concepts: kinds of things that can be learned aim: intelligible and operational concept description Instances: the individual,

More information

Introduction to Artificial Intelligence

Introduction to Artificial Intelligence Introduction to Artificial Intelligence COMP307 Machine Learning 2: 3-K Techniques Yi Mei yi.mei@ecs.vuw.ac.nz 1 Outline K-Nearest Neighbour method Classification (Supervised learning) Basic NN (1-NN)

More information

刘淇 School of Computer Science and Technology USTC

刘淇 School of Computer Science and Technology USTC Data Exploration 刘淇 School of Computer Science and Technology USTC http://staff.ustc.edu.cn/~qiliuql/dm2013.html t t / l/dm2013 l What is data exploration? A preliminary exploration of the data to better

More information

Data Mining: Exploring Data

Data Mining: Exploring Data Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar But we start with a brief discussion of the Friedman article and the relationship between Data

More information

Analytical Techniques for Anomaly Detection Through Features, Signal-Noise Separation and Partial-Value Association

Analytical Techniques for Anomaly Detection Through Features, Signal-Noise Separation and Partial-Value Association Proceedings of Machine Learning Research 77:20 32, 2017 KDD 2017: Workshop on Anomaly Detection in Finance Analytical Techniques for Anomaly Detection Through Features, Signal-Noise Separation and Partial-Value

More information

Orange3 Educational Add-on Documentation

Orange3 Educational Add-on Documentation Orange3 Educational Add-on Documentation Release 0.1 Biolab Jun 01, 2018 Contents 1 Widgets 3 2 Indices and tables 27 i ii Widgets in Educational Add-on demonstrate several key data mining and machine

More information

CPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2016

CPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2016 CPSC 340: Machine Learning and Data Mining Outlier Detection Fall 2016 Admin Assignment 1 solutions will be posted after class. Assignment 2 is out: Due next Friday, but start early! Calculus and linear

More information

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs An Introduction to Cluster Analysis Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs zhaoxia@ics.uci.edu 1 What can you say about the figure? signal C 0.0 0.5 1.0 1500 subjects Two

More information

KTH ROYAL INSTITUTE OF TECHNOLOGY. Lecture 14 Machine Learning. K-means, knn

KTH ROYAL INSTITUTE OF TECHNOLOGY. Lecture 14 Machine Learning. K-means, knn KTH ROYAL INSTITUTE OF TECHNOLOGY Lecture 14 Machine Learning. K-means, knn Contents K-means clustering K-Nearest Neighbour Power Systems Analysis An automated learning approach Understanding states in

More information

Knowledge Discovery and Data Mining I

Knowledge Discovery and Data Mining I Ludwig-Maximilians-Universität München Lehrstuhl für Datenbanksysteme und Data Mining Prof. Dr. Thomas Seidl Knowledge Discovery and Data Mining I Winter Semester 8/9 Agenda. Introduction. Basics. Data

More information

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York A Systematic Overview of Data Mining Algorithms Sargur Srihari University at Buffalo The State University of New York 1 Topics Data Mining Algorithm Definition Example of CART Classification Iris, Wine

More information

Course on Microarray Gene Expression Analysis

Course on Microarray Gene Expression Analysis Course on Microarray Gene Expression Analysis ::: Normalization methods and data preprocessing Madrid, April 27th, 2011. Gonzalo Gómez ggomez@cnio.es Bioinformatics Unit CNIO ::: Introduction. The probe-level

More information

CPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2018

CPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2018 CPSC 340: Machine Learning and Data Mining Outlier Detection Fall 2018 Admin Assignment 2 is due Friday. Assignment 1 grades available? Midterm rooms are now booked. October 18 th at 6:30pm (BUCH A102

More information

Hands on Datamining & Machine Learning with Weka

Hands on Datamining & Machine Learning with Weka Step1: Click the Experimenter button to launch the Weka Experimenter. The Weka Experimenter allows you to design your own experiments of running algorithms on datasets, run the experiments and analyze

More information

Out of Bundle Vodafone

Out of Bundle Vodafone Out of Bundle Vodafone Out of Bundle Charges - Vodafone The following charges apply once you have used your allowance, or if you call/text/browse outside of your allowance: Standard UK call charges Cost

More information

Statistical Methods in AI

Statistical Methods in AI Statistical Methods in AI Distance Based and Linear Classifiers Shrenik Lad, 200901097 INTRODUCTION : The aim of the project was to understand different types of classification algorithms by implementing

More information

CHIRP: A new classifier based on Composite Hypercubes on Iterated Random Projections

CHIRP: A new classifier based on Composite Hypercubes on Iterated Random Projections CHIRP: A new classifier based on Composite Hypercubes on Iterated Random Projections Leland Wilkinson Anushka Anand Department of Computer Department of Computer Science Science University of Illinois

More information

A review of data complexity measures and their applicability to pattern classification problems. J. M. Sotoca, J. S. Sánchez, R. A.

A review of data complexity measures and their applicability to pattern classification problems. J. M. Sotoca, J. S. Sánchez, R. A. A review of data complexity measures and their applicability to pattern classification problems J. M. Sotoca, J. S. Sánchez, R. A. Mollineda Dept. Llenguatges i Sistemes Informàtics Universitat Jaume I

More information

Basic Concepts Weka Workbench and its terminology

Basic Concepts Weka Workbench and its terminology Changelog: 14 Oct, 30 Oct Basic Concepts Weka Workbench and its terminology Lecture Part Outline Concepts, instances, attributes How to prepare the input: ARFF, attributes, missing values, getting to know

More information

Machine Learning: Algorithms and Applications Mockup Examination

Machine Learning: Algorithms and Applications Mockup Examination Machine Learning: Algorithms and Applications Mockup Examination 14 May 2012 FIRST NAME STUDENT NUMBER LAST NAME SIGNATURE Instructions for students Write First Name, Last Name, Student Number and Signature

More information

Contents. Preface to the Second Edition

Contents. Preface to the Second Edition Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................

More information

Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms

Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science Bren School of Information and Computer Sciences University of California,

More information

Features: representation, normalization, selection. Chapter e-9

Features: representation, normalization, selection. Chapter e-9 Features: representation, normalization, selection Chapter e-9 1 Features Distinguish between instances (e.g. an image that you need to classify), and the features you create for an instance. Features

More information

Data Mining - Data. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - Data 1 / 47

Data Mining - Data. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - Data 1 / 47 Data Mining - Data Dr. Jean-Michel RICHER 2018 jean-michel.richer@univ-angers.fr Dr. Jean-Michel RICHER Data Mining - Data 1 / 47 Outline 1. Introduction 2. Data preprocessing 3. CPA with R 4. Exercise

More information

Video Cassette Player

Video Cassette Player 3-861-061-12 (1) Video Cassette Player Operating Instructions Before operating the unit, please read this manual thoroughly, and retain it for future reference. GV-F700 1997 by Sony Corporation Operations

More information

Data Exploration and Preparation Data Mining and Text Mining (UIC Politecnico di Milano)

Data Exploration and Preparation Data Mining and Text Mining (UIC Politecnico di Milano) Data Exploration and Preparation Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining, : Concepts and Techniques", The Morgan Kaufmann

More information

Chapter 1, TUFTE STYLE GRIDDING FOR READABILITY. Chapter 5, SLICE (CROSS-SECTIONAL VIEWS)

Chapter 1, TUFTE STYLE GRIDDING FOR READABILITY. Chapter 5, SLICE (CROSS-SECTIONAL VIEWS) Chapter, TUFTE STYLE GRIDDING FOR READABILITY Chapter 5, SLICE (CROSS-SECTIONAL VIEWS) Number of responses 8 7 6 5 4 3 2 9 8 7 6 5 4 3 2 Distribution of ethnicities in each income group of SF bay area

More information

Practical Machine Learning Agenda

Practical Machine Learning Agenda Practical Machine Learning Agenda Starting From Log Management Moving To Machine Learning PunchPlatform team Thales Challenges Thanks 1 Starting From Log Management 2 Starting From Log Management Data

More information

Boxplot

Boxplot Boxplot By: Meaghan Petix, Samia Porto & Franco Porto A boxplot is a convenient way of graphically depicting groups of numerical data through their five number summaries: the smallest observation (sample

More information

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display CURRICULUM MAP TEMPLATE Priority Standards = Approximately 70% Supporting Standards = Approximately 20% Additional Standards = Approximately 10% HONORS PROBABILITY AND STATISTICS Essential Questions &

More information

Transforming Scagnostics to Reveal Hidden Features

Transforming Scagnostics to Reveal Hidden Features 1624 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 20, NO. 12, DECEMBER 2014 Transforming Scagnostics to Reveal Hidden Features Tuan Nhon Dang, Leland Wilkinson Fig. 1. Scatterplots derived

More information

Package Modeler. May 18, 2018

Package Modeler. May 18, 2018 Version 3.4.3 Date 2018-05-17 Package Modeler May 18, 2018 Title Classes and Methods for Training and Using Binary Prediction Models Author Kevin R. Coombes Maintainer Kevin R. Coombes

More information

Data analysis case study using R for readily available data set using any one machine learning Algorithm

Data analysis case study using R for readily available data set using any one machine learning Algorithm Assignment-4 Data analysis case study using R for readily available data set using any one machine learning Algorithm Broadly, there are 3 types of Machine Learning Algorithms.. 1. Supervised Learning

More information

Some Open Problems in Graph Theory and Computational Geometry

Some Open Problems in Graph Theory and Computational Geometry Some Open Problems in Graph Theory and Computational Geometry David Eppstein Univ. of California, Irvine Dept. of Information and Computer Science ICS 269, January 25, 2002 Two Models of Algorithms Research

More information

Course: Geometry PAP Prosper ISD Course Map Grade Level: Estimated Time Frame 6-7 Block Days. Unit Title

Course: Geometry PAP Prosper ISD Course Map Grade Level: Estimated Time Frame 6-7 Block Days. Unit Title Unit Title Unit 1: Geometric Structure Estimated Time Frame 6-7 Block 1 st 9 weeks Description of What Students will Focus on on the terms and statements that are the basis for geometry. able to use terms

More information

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools Abstract In this project, we study 374 public high schools in New York City. The project seeks to use regression techniques

More information

Link Prediction for Social Network

Link Prediction for Social Network Link Prediction for Social Network Ning Lin Computer Science and Engineering University of California, San Diego Email: nil016@eng.ucsd.edu Abstract Friendship recommendation has become an important issue

More information

Medical Image Analysis

Medical Image Analysis Medical Image Analysis Instructor: Moo K. Chung mchung@stat.wisc.edu Lecture 10. Multiple Comparisons March 06, 2007 This lecture will show you how to construct P-value maps fmri Multiple Comparisons 4-Dimensional

More information

Package EMDomics. November 11, 2018

Package EMDomics. November 11, 2018 Package EMDomics November 11, 2018 Title Earth Mover's Distance for Differential Analysis of Genomics Data The EMDomics algorithm is used to perform a supervised multi-class analysis to measure the magnitude

More information

Investigating Country Differences in Mobile App User Behaviour and Challenges for Software Engineering. Soo Ling Lim

Investigating Country Differences in Mobile App User Behaviour and Challenges for Software Engineering. Soo Ling Lim Investigating Country Differences in Mobile App User Behaviour and Challenges for Software Engineering Soo Ling Lim Analysis of app store data reveals what users do in the app store. We want to know why

More information

Experimental Design + k- Nearest Neighbors

Experimental Design + k- Nearest Neighbors 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Experimental Design + k- Nearest Neighbors KNN Readings: Mitchell 8.2 HTF 13.3

More information

Exploring Data data exploration Exploratory Data Analysis

Exploring Data data exploration Exploratory Data Analysis 3 Exploring Data The previous chapter addressed high-level data issues that are important in the knowledge discovery process This chapter provides an introduction to data exploration, which is a preliminary

More information

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1 Preface to the Second Edition Preface to the First Edition vii xi 1 Introduction 1 2 Overview of Supervised Learning 9 2.1 Introduction... 9 2.2 Variable Types and Terminology... 9 2.3 Two Simple Approaches

More information

Access Code and Phone Number

Access Code and Phone Number Algeria Dial International collect/reverse charge number: 1-212-559-5842 Argentina For phones using Telecom: Dial 0-800-555-4288; wait for prompt, then dial 866- For phones using Telefonica: Dial 0-800-222-1288;

More information

SVM Classification in -Arrays

SVM Classification in -Arrays SVM Classification in -Arrays SVM classification and validation of cancer tissue samples using microarray expression data Furey et al, 2000 Special Topics in Bioinformatics, SS10 A. Regl, 7055213 What

More information

Mathematics Scope & Sequence Grade 8 Revised: June 2015

Mathematics Scope & Sequence Grade 8 Revised: June 2015 Mathematics Scope & Sequence 2015-16 Grade 8 Revised: June 2015 Readiness Standard(s) First Six Weeks (29 ) 8.2D Order a set of real numbers arising from mathematical and real-world contexts Convert between

More information

Curriculum Connections (Fractions): K-8 found at under Planning Supports

Curriculum Connections (Fractions): K-8 found at   under Planning Supports Curriculum Connections (Fractions): K-8 found at http://www.edugains.ca/newsite/digitalpapers/fractions/resources.html under Planning Supports Kindergarten Grade 1 Grade 2 Grade 3 Grade 4 Grade 5 Grade

More information

Analysis and Latent Semantic Indexing

Analysis and Latent Semantic Indexing 18 Principal Component Analysis and Latent Semantic Indexing Understand the basics of principal component analysis and latent semantic index- Lab Objective: ing. Principal Component Analysis Understanding

More information

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Improving the Efficiency of Fast Using Semantic Similarity Algorithm International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year

More information

Dashboard. Jan 13, Jan 8, 2012 Comparing to: Site. 12,742 Visits % Bounce Rate. 00:05:26 Avg. Time on Site.

Dashboard. Jan 13, Jan 8, 2012 Comparing to: Site. 12,742 Visits % Bounce Rate. 00:05:26 Avg. Time on Site. Dashboard 3 3 15 15 Jan 17 Feb 18 Mar 22 Apr 23 May 25 Jun 26 Jul 28 Aug 29 Sep 3 Nov 1 Dec 3 Ja Site Usage 12,742 4.3% Bounce Rate 39,496 Pageviews :5:26 Avg. Time on Site 3.1 Pages/Visit 61.73% % New

More information

Brookhaven School District Pacing Guide Fourth Grade Math

Brookhaven School District Pacing Guide Fourth Grade Math Brookhaven School District Pacing Guide Fourth Grade Math 11 days Aug. 7- Aug.21 1 st NINE WEEKS Chapter 1- Place value, reading, writing and comparing whole numbers, rounding, addition and subtraction

More information

We have solid distributors in over 120 countries China China China Healthcare Equipment Medical Equipment Healthcare Equipment

We have solid distributors in over 120 countries China China China Healthcare Equipment Medical Equipment Healthcare Equipment Mali Ireland United Kingdom Holland Belgium Ukraine Belgium Germany Czech Republic Austria Slovakia France Hungary Moldova Switzerland Slovenia Albania Romania Spain Bulgaria Portugal Italy Montenegro

More information

CHIRP: A new classifier based on Composite Hypercubes on Iterated Random Projections

CHIRP: A new classifier based on Composite Hypercubes on Iterated Random Projections CHIRP: A new classifier based on Composite Hypercubes on Iterated Random Projections Leland Wilkinson Department of Computer Science University of Illinois at Chicago Chicago, IL 60606, USA leland.wilkinson@systat.com

More information

Minimum-Spanning-Tree problem. Minimum Spanning Trees (Forests) Minimum-Spanning-Tree problem

Minimum-Spanning-Tree problem. Minimum Spanning Trees (Forests) Minimum-Spanning-Tree problem Minimum Spanning Trees (Forests) Given an undirected graph G=(V,E) with each edge e having a weight w(e) : Find a subgraph T of G of minimum total weight s.t. every pair of vertices connected in G are

More information

k Nearest Neighbors Super simple idea! Instance-based learning as opposed to model-based (no pre-processing)

k Nearest Neighbors Super simple idea! Instance-based learning as opposed to model-based (no pre-processing) k Nearest Neighbors k Nearest Neighbors To classify an observation: Look at the labels of some number, say k, of neighboring observations. The observation is then classified based on its nearest neighbors

More information

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06 Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,

More information

Summary. Machine Learning: Introduction. Marcin Sydow

Summary. Machine Learning: Introduction. Marcin Sydow Outline of this Lecture Data Motivation for Data Mining and Learning Idea of Learning Decision Table: Cases and Attributes Supervised and Unsupervised Learning Classication and Regression Examples Data:

More information

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Murhaf Fares & Stephan Oepen Language Technology Group (LTG) September 27, 2017 Today 2 Recap Evaluation of classifiers Unsupervised

More information

Middle School Math Course 3 Correlation of the ALEKS course Middle School Math 3 to the Illinois Assessment Framework for Grade 8

Middle School Math Course 3 Correlation of the ALEKS course Middle School Math 3 to the Illinois Assessment Framework for Grade 8 Middle School Math Course 3 Correlation of the ALEKS course Middle School Math 3 to the Illinois Assessment Framework for Grade 8 State Goal 6: Number Sense 6.8.01: 6.8.02: 6.8.03: 6.8.04: 6.8.05: = ALEKS

More information

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof. CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing University of Florida, CISE Department Prof. Daisy Zhe Wang Data Visualization Value of Visualization Data And Image Models

More information

Cisco Aironet In-Building Wireless Solutions International Power Compliance Chart

Cisco Aironet In-Building Wireless Solutions International Power Compliance Chart Cisco Aironet In-Building Wireless Solutions International Power Compliance Chart ADDITIONAL INFORMATION It is important to Cisco Systems that its resellers comply with and recognize all applicable regulations

More information

Supplementary Material

Supplementary Material Supplementary Material Figure 1S: Scree plot of the 400 dimensional data. The Figure shows the 20 largest eigenvalues of the (normalized) correlation matrix sorted in decreasing order; the insert shows

More information

MATHEMATICS Grade 7 Advanced Standard: Number, Number Sense and Operations

MATHEMATICS Grade 7 Advanced Standard: Number, Number Sense and Operations Standard: Number, Number Sense and Operations Number and Number Systems A. Use scientific notation to express large numbers and numbers less than one. 1. Use scientific notation to express large numbers

More information

Data Mining of Range-Based Classification Rules for Data Characterization

Data Mining of Range-Based Classification Rules for Data Characterization Data Mining of Range-Based Classification Rules for Data Characterization Achilleas Tziatzios A thesis submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer

More information

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

Predictive Analysis: Evaluation and Experimentation. Heejun Kim Predictive Analysis: Evaluation and Experimentation Heejun Kim June 19, 2018 Evaluation and Experimentation Evaluation Metrics Cross-Validation Significance Tests Evaluation Predictive analysis: training

More information

EPL451: Data Mining on the Web Lab 10

EPL451: Data Mining on the Web Lab 10 EPL451: Data Mining on the Web Lab 10 Παύλος Αντωνίου Γραφείο: B109, ΘΕΕ01 University of Cyprus Department of Computer Science Dimensionality Reduction Map points in high-dimensional (high-feature) space

More information

PITSCO Math Individualized Prescriptive Lessons (IPLs)

PITSCO Math Individualized Prescriptive Lessons (IPLs) Orientation Integers 10-10 Orientation I 20-10 Speaking Math Define common math vocabulary. Explore the four basic operations and their solutions. Form equations and expressions. 20-20 Place Value Define

More information

10701 Machine Learning. Clustering

10701 Machine Learning. Clustering 171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among

More information

A Keypoint Descriptor Inspired by Retinal Computation

A Keypoint Descriptor Inspired by Retinal Computation A Keypoint Descriptor Inspired by Retinal Computation Bongsoo Suh, Sungjoon Choi, Han Lee Stanford University {bssuh,sungjoonchoi,hanlee}@stanford.edu Abstract. The main goal of our project is to implement

More information

Introduction to R and Statistical Data Analysis

Introduction to R and Statistical Data Analysis Microarray Center Introduction to R and Statistical Data Analysis PART II Petr Nazarov petr.nazarov@crp-sante.lu 22-11-2010 OUTLINE PART II Descriptive statistics in R (8) sum, mean, median, sd, var, cor,

More information

Topic Code Indicators Srategies

Topic Code Indicators Srategies Angle Relationships, Parallel Lines Angle Relationships, Interior/Exterior angles 8G2 8M8 Recognize the angles formed and the relationship between the angles when two lines intersect and when parallel

More information

Visual Computing. Lecture 2 Visualization, Data, and Process

Visual Computing. Lecture 2 Visualization, Data, and Process Visual Computing Lecture 2 Visualization, Data, and Process Pipeline 1 High Level Visualization Process 1. 2. 3. 4. 5. Data Modeling Data Selection Data to Visual Mappings Scene Parameter Settings (View

More information

Integrated Math I. IM1.1.3 Understand and use the distributive, associative, and commutative properties.

Integrated Math I. IM1.1.3 Understand and use the distributive, associative, and commutative properties. Standard 1: Number Sense and Computation Students simplify and compare expressions. They use rational exponents and simplify square roots. IM1.1.1 Compare real number expressions. IM1.1.2 Simplify square

More information

Center for the Management of Information Technology March 9, 2012

Center for the Management of Information Technology March 9, 2012 Big Data @ comscore Turning Billions of Rows into Market Insight Michael Brown March 9 th, 2012 comscore is a Global Leader in Measuring the Digital World NASDAQ Clients SCOR Employees 1000+ Headquarters

More information

Computational Geometry. Algorithm Design (10) Computational Geometry. Convex Hull. Areas in Computational Geometry

Computational Geometry. Algorithm Design (10) Computational Geometry. Convex Hull. Areas in Computational Geometry Computational Geometry Algorithm Design (10) Computational Geometry Graduate School of Engineering Takashi Chikayama Algorithms formulated as geometry problems Broad application areas Computer Graphics,

More information