Visually-Motivated Characterizations of Point Sets Embedded in High-Dimensional Geometric Spaces: Current Projects

Size: px

Start display at page:

Download "Visually-Motivated Characterizations of Point Sets Embedded in High-Dimensional Geometric Spaces: Current Projects"

Rosa Johnson
5 years ago
Views:

1 Visually-Motivated Characterizations of Point Sets Embedded in High-Dimensional Geometric Spaces: Current Projects Leland Wilkinson SYSTAT University of Illinois at Chicago (Computer Science) Northwestern University (Statistics) Robert Grossman University of Illinois at Chicago (Computer Science) Adilson Motter Northwestern University (Physics) 1

2 Subprojects Scagnostics Classification Venn/Euler Diagrams Treemaps 2

3 Scagnostics Wilkinson, Anand, and Grossman (2006) characterize a scatterplot (2D point set) with nine measures. We base our measures on three geometric graphs. Convex Hull Alpha Shape Minimum Spanning Tree 3

4 Each geometric graph is a subset of the Delaunay Triangulation 4

5 Convex Hull 5

6 Alpha Shape 6

7 Minimum Spanning Tree 7

8 Shape Convex: area of alpha shape divided by area of convex hull Skinny: ratio of perimeter to area of the alpha shape Stringy: ratio of 2-degree vertices in MST to number of vertices > 1-degree 8

9 Trend Monotonic: squared Spearman correlation coefficient 9

10 Density Skewed: ratio of (Q90 - Q50) / (Q90 - Q10), where quantiles are on MST edge lengths Clumpy: 1 minus the ratio of the longest edge in the largest runt (blue) to the length of runt-cutting edge (red) longest edge in runt largest runt Outlying: proportion of total MST length due to edges adjacent to outliers 10

11 Density Sparse: 90th percentile of distribution of edge lengths in MST Striated: proportion of all vertices in the MST that are degree-2 and have a cosine between adjacent edges less than

12 Software (Wilkinson and Anand) 12

13 Fishing Expeditions in Visual Analytics We have used the empirical distribution of Scagnostics (Wilkinson and Wills, JCGS, 2008), the False Discovery Rate (FDR) statistic, and automated statistical modeling to develop algorithms for reducing false discoveries when people use visual-analytic software (Wilkinson and Wills, IVS, 2009). 40 Mean Elevation Temp(F) Jan Feb Mar Apr May Jun Jul Aug Sep Oct Day Temperature Day These data are fake. They are a random walk produced by a random number generator. These data are real. They are temperature measurements of a cow over 80-days. The data are periodic. 13

14 Distribution of Scagnostics Monotonic Stringy Skinny Convex Striated Sparse Clumpy Skewed Outlying Measure 14

15 False Discovery Rate (FDR) Benjamini and Hochberg, JRSS, 1995 Sort a list of p values and define p0 = 0 P = [p 1,..., p m ] Compute BH index Use adjusted p value (indexed by rbh) as cutoff Based on uniform distribution of p values, so tests may be heterogeneous 15

16 Scagnostics on Graphs Adilson Motter is developing scagnostics for (V, E) graphs. This effort is similar to what we did for scatterplots. One wants a relatively small set of measures that are relatively independent and that nicely characterize a wide universe of real directed and undirected graphs. The goal is to classify real graphs, recognize anomalies, and locate exemplars in subgraphs of large graphs. 16

17 Classification Based on our analysis of how people visually classify 2D point sets, we have developed a high-dimensional, linear complexity, supervised classifier called Linf. This classifier, using the L-infinity metric, rivals or outperforms leading classifiers on 10 difficult datasets. 17

18 Visual Classification Anand, Wilkinson, and Tuan (ICDM, 2009) developed a visual classifier to analyze how people distinguish target point sets from background point sets. We designed the software to behave like a video game. 18

19 Description Regions Weighted L-infinity norm x = sup(w 1 x 1, w 2 x 2,... w n x n ) Hypercube Description Region (HDR) The set of points less than a fixed distance from a single point using the L-infinity norm. Composite Hypercube Description Region (CHDR) Union of HDRs. 19

20 The Linf Algorithm 1.Transform t(x) = sgn(x)sqrt(abs(x)) 2.Project 3.Bin Pick next target class (one against all). Compute 25 random, integer-weighted, 2D projections. Pick best 5 projections on a class-separation measure. b = 2log 2 (n) Compute purity of target class instances in each bin. 4.Cover j j j j i 5.Repeat Store best cover in scoring list. Repeat steps 2-4 until no training instances left. 20

21 Results waveform S R L B T B R S L T vowel R T SB L B T R S L Dataset spect shuttle satellite orange10 optdigits L R TL SB T R L B S R LB T T SRL B R S B T S R T S B B B B T S L B S T R L R S T R T R S L L L Naive Bayes Support Vector Machine Classification Tree Random Forest Linf madelon T L B S R B R T L S Standardized Error cancer L S T R B BR T S L adult SR T LB B T R L S B : Naive Bayes L : Linf R : Random Forest S : Support Vector Machine T : Classification Tree Error (%) Time (sec) 21

22 Venn/Euler Diagrams How do we fit a Venn/Euler diagram to empirical data involving more than 3 sets? These diagrams are widely used in the bioinformatics community. We have developed an R function called venneuler() and have posted it on CRAN for general use. Euler diagram for 11 animals based on gene lists from the Agilent DNA oligo microarray database. The analysis was based on 247,412 gene symbols and 11 animal names. The stress for this solution is.06, with corresponding correlation of 0.97 between circle and intersection areas and counts of genes. The computation was completed in under 30 seconds on a 2.5 GHz MacBook Pro running the Java 1.5 Virtual Machine in 2GB of allocated memory. Mouse Rat Frog Pig Sheep Rabbit Horse Human Chicken Cow Dog 22

23 The venneuler() Algorithm 1.Make list of m intersections among n sets (m = 2 n ) 2.Make list of disk intersections (size disks proportional to Xi ). 3.Make list of disjoint counts and list of disjoint areas. 4.Estimate model. 5.Move disks and repeat 2-5 to minimize error, using gradient: 23

24 Calculating Areas

25 Treemaps We are investigating the ordering of treemaps. 25

Seriating a Tree Canada Italy France Sweden WGermany Belgium Denmark UK Austria Switzerland Czechoslov Poland Cuba Chile Argentina Uruguay Panama Colombia Peru Brazil Ecuador ElSalvador Guatemala

26 Seriating a Tree Canada Italy France Sweden WGermany Belgium Denmark UK Austria Switzerland Czechoslov Poland Cuba Chile Argentina Uruguay Panama Colombia Peru Brazil Ecuador ElSalvador Guatemala Iraq Libya Ethiopia Haiti Somalia Afghanistan Guinea Permuted Data Matrix URBAN LITERACY LIFEEXPF LIFEEXPM GDP_CAP HEALTH EDUC MIL DEATH_RT BABYMORT BIRTH_RT B_TO_D Wilkinson and Friendly, TAS,

27 Fisher-Anderson Iris Data Petal Length + Petal Width Sepal Length + Sepal Width Setosa Versicolor Virginica Species X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X19 X20 X21 X22 X23 X24 X25 X26 X27 X28 X29 X30 X31 X32 X33 X34 X35 X36 X37 X38 X39 X41 X42 X43 X44 X45 X46 X47 X48 X49 X50 X51 X52 X53 X54 X55 X56 X57 X58 X59 X60 X61 X62 X63 X64 X65 X66 X67 X68 X69 X70 X71 X72 X73 X74 X75 X76 X77 X78 X79 X80 X81 X82 X83 X84 X85 X86 X87 X88 X89 X90 X91 X92 X93 X94 X95 X96 X97 X98 X99 X100 X101 X103 X104 X105 X106 X107 X109 X110 X111 X112 X113 X114 X115 X116 X118 X119 X120 X122 X123 X124 X125 X126 X127 X128 X130 X132 X134 X135 X136 X137 X139 X140 X141 X142 X145 X146 X147 X148 X149 X150 X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X19 X20 X21 X22 X23 X24 X25 X26 X27 X28 X29 X30 X31 X32 X33 X34 X35 X36 X37 X38 X39 X41 X42 X43 X44 X45 X46 X47 X48 X49 X50 X51 X52 X53 X54 X55 X56 X57 X58 X59 X60 X61 X62 X63 X64 X65 X66 X67 X68 X69 X70 X71 X72 X73 X74 X75 X76 X77 X78 X79 X80 X81 X82 X83 X84 X85 X86 X87 X88 X89 X90 X91 X92 X93 X94 X95 X96 X97 X98 X99 X100 X100 X101 X103 X104 X105 X106 X107 X109 X110 X111 X112 X113 X114 X115 X116 X118 X119 X120 X122 X123 X124 X125 X126 X127 X128 X130 X132 X134 X135 X136 X137 X139 X140 X141 X142 X145 X146 X147 X148 X149 X150

FODAVA Partners Leland Wilkinson (SYSTAT & UIC) Robert Grossman (UIC) Adilson Motter (Northwestern) Anushka Anand, Troy Hernandez (UIC)

FODAVA Partners Leland Wilkinson (SYSTAT & UIC) Robert Grossman (UIC) Adilson Motter (Northwestern) Anushka Anand, Troy Hernandez (UIC) Visually-Motivated Characterizations of Point Sets Embedded in High-Dimensional