Tutorial 3. Chiun-How Kao 高君豪

Size: px

Start display at page:

Download "Tutorial 3. Chiun-How Kao 高君豪"

Peregrine Adams
5 years ago
Views:

1 Tutorial 3 Chiun-How Kao 高君豪 maokao@stat.sinica.edu.tw

2 Introduction Generalized Association Plots (GAP) Presentation of Raw Data Matrix Seriation of Proximity Matrices and Raw Data Matrix Partitions of Permuted Matrix Maps Sufficient Graph Interval GAP (igap) Demo Conclusion

visualization as an EDA tool for assisting formal mathematical modeling Exploratory Data Analysis (EDA, John Tukey (977)) It is important to understand what you CAN DO before you learn to

3 visualization as an EDA tool for assisting formal mathematical modeling Exploratory Data Analysis (EDA, John Tukey (977)) It is important to understand what you CAN DO before you learn to measure how WELL you seem to have DONE it. allow the data to speak for themselves before standard assumptions or formal modeling graphics-oriented tools the box/whisker plot, the scatterplot, etc.

4 Generalized Association Plots (GAP) Presentation of Raw Data Matrix Seriation of Proximity Matrices and Raw Data Matrix Partitions of Permuted Matrix Maps Sufficient Graph

Four Steps of Generalized Association Plots (GAP) Raw data

Rating - Scale 5 D a S a S d S S S v v2 v3 v4 v5 D d V a

Data Set V d GAP Chen (996, 999, and 22) integrated

seriation (R2E) color representation V b D b DL4 TH6 TH8

DL7 DL E v TH3 TH4 TH2 TH TH5 DL3 BE4 NA NA2 NA3 NA4 NA5

5 Four Steps of Generalized Association Plots (GAP) Raw data matrix Corr. Rating - Scale 5 D a S a S d S S S v v2 v3 v4 v5 D d V a (a). Raw Data Map and Proximity Maps with Suitable Color Projection (d). Sufficient Graphs with Three Linkages for a Multivariate Data Set V d GAP Chen (996, 999, and 22) integrated visualization raw data matrix two proximity matrices seriation (R2E) color representation V b D b DL4 TH6 TH8 TH7 AH6 DL DL2 AH4 AH5 DL5 DL9 DL BE3 DL6 DL8 AH2 AH3 AH DL7 DL E v TH3 TH4 TH2 TH TH5 DL3 BE4 NA NA2 NA3 NA4 NA5 NA6 NA7 NB NB2 NB3 NB4 NC NC2 NC3 ND DL2 BE BE2 ND2 ND3 NE NE2 ND4 E s (b). Sorted Data Map and Proximity Maps with Principle of Geometry S b R ( 4) (c). Partitioned Data Map and Proximity Maps with near Stationary Iterations R ( 3) for patients S c S S S for symptoms v v2 v3 v4 v5 V c D c

6 The st Step of GAP Presentation of Raw Data Matrix Data Transformation Selection of Proximity Measures Color Spectrum Display Conditions

7 Presentation of Raw Data Matrix

8 Display Conditions

9 Display Conditions

10 The 2 nd Step of GAP Seriation of Proximity Matrices and Raw Data Matrix Relativity of Statistical Graph Global Criterion Rank-Two Elliptical Seriation Local Criterion Tree Seriation Flipping of Tree Intermediate Nodes Evaluation of permutation algorithms The Generalized anti-robinson (GAR) criterion

11 Hierarchical Clustering Tree (Kaufman and Rousseeuw,99) Example: Average-Linkage

12 Flipping of Tree Intermediate Nodes (a) Different Seriations (Ordering of Terminal Nodes or Leaves) Generated from Identical Tree Structure A B C D E (b) B A C E D (c) ideal model flip 3 flips 5 flips many flips Eisen et al. (998) C E D B A 2 n- =2 5- =6 external and internal references for guiding flipping mechanism

Expression - + (b) Correlation -8 : +8 - + (a) Expression (b) Correlation GAP

13 Flipping of Tree Intermediate Nodes - + (c) Correlation HCT + R2E = HCT R2E (d) - + (c) Correlation (d) (e) - + (c) Correlation (d) -8 : +8 (a) Expression - + (b) Correlation -8 : (a) Expression (b) Correlation GAP Elliptical (R2E) Seriation -8 : +8 (a) Expression - + (b) Correlation Tree guided by (R2E)

14 Seriation and Robinson Matrix A square similarity matrix is called a Robinson matrix if the highest entries within each row and column are on the main diagonal and if, when moving away from this diagonal, the entries never increase.

AR = n Evaluation of permutation algorithms The

I( d < d ) + ij ik i= j< k< i i< j< k I( d (b) ij

ik )] i= (i w) j<k<i i< j<k (i+w) (Local) w = 2 3

dik ) + i= ( i w) j< k< i i< j< k ( i+ w) = n [ +

15 AR = n Evaluation of permutation algorithms The Generalized anti-robinson (GAR) criterion [ (a) I( d < d ) + ij ik i= j< k< i i< j< k I( d (b) ij > d ik )] n GAR = [ I(d ij < d ik ) + I(d ij > d ik )] i= (i w) j<k<i i< j<k (i+w) (Local) w = 2 3 n- (Global) (c) RGAR Relative GAR n [ I( dij < dik ) + i= ( i w) j< k< i i< j< k ( i+ w) = n [ + I( d ] i= ( i w) j< k< i i< j< k ( i+ w) ij > d ik )]

16 The 3 rd Step of GAP Partitions of Permuted Matrix Maps The 4 th Step of GAP Sufficient Graph

17 Sufficient Graph

18 Generalization and Flexibility

19 Interval GAP (igap) Kao, C. H., Nakano, J., Shieh, S. H., Tien, Y. J., Wu, H. M., Yang, C. K., and Chen, C. H*. (24). Exploratory data analysis of interval-valued symbolic data with matrix visualization, Computational Statistics and Data Analysis, 79, Introduction Presentation of the raw data matrices Example

20 Classical Data : Individuals: A single value Single player E.g., age = 25, eye color = blue Symbolic Data : Symbolic units (groups/classes) Team interval : age range = [2, 36] multiple values: eye color = {blue,brown,black} distribution: {blue.5, brown.3, black.2} (Billard and Diday (26))

21 When we are interested the higher level units (groups/classes). When the initial data are composed by Symbolic data tables.

23 Interval-valued symbolic random variable Y is one that takes values in an interval [7,25] Multi-valued symbolic random variable Y is one or more values Modal multi-valued Y ( u) = { η, π ; k =,2,..., su} k k {2,23,2} {single, 3/8, married, 5/8} Modal interval-valued (Histogram) {[2,4), /7, [4, 6), 2/7, [6, 8], 4/7} Y ( u) = {[ auk, buk), puk; k =,2,..., su}

24 Classical 3. Variable Proximity Correlation Covariance polychoric Correlation Variable Proximity? Symbolic 2. Subject Proximity. Data Matrix. Data Matrix 2. Subject Proximity Euclidean Distance Manhattan Distance Correlation??

29 Color coding for interval-type data

The original data are available from the RDA (http://dss.ucar.

30 The original data are available from the RDA ( the lowest and highest temperature observed over the twelve months of 988 sixty meteorological stations in China

34 This database provides censuslike manpower information and economic activities for four levels of hierarchy of townships (989~2) Level : regions Level 2: 5 areas Level 3: 82 districts Level 4: 899 cities 58 variables (Rank-transformation) ~899 Income, Tax Population Indices Business and Public services Industry and Car Stores, Education, and Expenditure Agriculture

35 Level Level 2 Level 3 Level 4 Region () Area (5) District (82) City (899) 58 variables 899 Level 4 Cities continuous Data Rank Data (~899) 58 variables (interval) merged (interval of ranks) data covariate Level Regions 5 Level 2 Areas

concepts (Level 2 areas) Business, Public

Industrial areas (Toyota, Mitsubishi, ) Areas

36 Income, Tax, Main working pop. Pop. Indices 58 interval variables (range ~899) 5 concepts (Level 2 areas) Business, Public services Industry and Car Stores, Education Expenditure Agriculture, Senior Citizen Greater Tokyo Greater Osaka Highest Pop. Industrial areas (Toyota, Mitsubishi, ) Areas with large city counts Rural areas, high area size and low pop. density

37 5 areas (concepts) Min Mid Max Length 58 interval variables Length < 949 len<949, 949<mid 746 < length 9<mid< Sufficient Sediment Row Condition Col Condition

38 Demo (igap software) Format of input data Operation environment of igap Displaying modes

39 More on GAP MV for binary data MV for categorical data MV with cartography links MV for modal multi-valued data MV for data with missing values MV for mixed data MV for huge data set MV for time series data

Cluster Analysis and Visualization. Workshop on Statistics and Machine Learning 2004/2/6

Cluster Analysis and Visualization Workshop on Statistics and Machine Learning 2004/2/6 Outlines Introduction Stages in Clustering Clustering Analysis and Visualization One/two-dimensional Data Histogram,