Why visualisation? IRDS: Visualization. Univariate data. Visualisations that we won t be interested in. Graphics provide little additional information

Similar documents
Feature Reduction and Selection

USING GRAPHING SKILLS

CS 534: Computer Vision Model Fitting

NAG Fortran Library Chapter Introduction. G10 Smoothing in Statistics

Lecture 4: Principal components

Biostatistics 615/815

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

A Semi-parametric Regression Model to Estimate Variability of NO 2

TN348: Openlab Module - Colocalization

Lecture 5: Multilayer Perceptrons

RStudio for Data Management,

Edge Detection in Noisy Images Using the Support Vector Machines

S1 Note. Basis functions.

X- Chart Using ANOM Approach

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc.

Parameter estimation for incomplete bivariate longitudinal data in clinical trials

Data Mining: Model Evaluation

Machine Learning: Algorithms and Applications

BioTechnology. An Indian Journal FULL PAPER. Trade Science Inc.

y and the total sum of

Wavefront Reconstructor

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

Simulation: Solving Dynamic Models ABE 5646 Week 11 Chapter 2, Spring 2010

A Binarization Algorithm specialized on Document Images and Photos

Outlier Detection based on Robust Parameter Estimates

Brave New World Pseudocode Reference

Announcements. Supervised Learning

Midterms Save the Dates!

Classifier Selection Based on Data Complexity Measures *

3D vector computer graphics

Random Variables and Probability Distributions

Machine Learning. Support Vector Machines. (contains material adapted from talks by Constantin F. Aliferis & Ioannis Tsamardinos, and Martin Law)

PRÉSENTATIONS DE PROJETS

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

Cluster Analysis of Electrical Behavior

Proper Choice of Data Used for the Estimation of Datum Transformation Parameters

Some Tutorial about the Project. Computer Graphics

LECTURE : MANIFOLD LEARNING

DETECTING ERRORS AND IMPUTING MISSING DATA FOR SINGLE LOOP SURVEILLANCE SYSTEMS

Face Recognition University at Buffalo CSE666 Lecture Slides Resources:

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

Support Vector Machines

Lecture 15: Memory Hierarchy Optimizations. I. Caches: A Quick Review II. Iteration Space & Loop Transformations III.

Analysis of Malaysian Wind Direction Data Using ORIANA

Support Vector Machines

Loop Permutation. Loop Transformations for Parallelism & Locality. Legality of Loop Interchange. Loop Interchange (cont)

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide

The Nottingham eprints service makes this work by researchers of the University of Nottingham available open access under the following conditions.

Lecture 5: Probability Distributions. Random Variables

Estimating Regression Coefficients using Weighted Bootstrap with Probability

Learning Ensemble of Local PDM-based Regressions. Yen Le Computational Biomedicine Lab Advisor: Prof. Ioannis A. Kakadiaris

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Data Mining For Multi-Criteria Energy Predictions

LEAST SQUARES. RANSAC. HOUGH TRANSFORM.

Machine Learning 9. week

7/12/2016. GROUP ANALYSIS Martin M. Monti UCLA Psychology AGGREGATING MULTIPLE SUBJECTS VARIANCE AT THE GROUP LEVEL

Wishing you all a Total Quality New Year!

Solutions to Programming Assignment Five Interpolation and Numerical Differentiation

Smoothing Spline ANOVA for variable screening

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

MATHEMATICS FORM ONE SCHEME OF WORK 2004

Mathematics 256 a course in differential equations for engineering students

Detecting Errors and Imputing Missing Data for Single-Loop Surveillance Systems

ECONOMICS 452* -- Stata 12 Tutorial 6. Stata 12 Tutorial 6. TOPIC: Representing Multi-Category Categorical Variables with Dummy Variable Regressors

FITTING A CHI -square CURVE TO AN OBSERVI:D FREQUENCY DISTRIBUTION By w. T Federer BU-14-M Jan. 17, 1951

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

Computer Animation and Visualisation. Lecture 4. Rigging / Skinning

Empirical Distributions of Parameter Estimates. in Binary Logistic Regression Using Bootstrap

Outlier Detection Methodologies Overview

A Comparative Study for Outlier Detection Techniques in Data Mining

Computer models of motion: Iterative calculations

News. Recap: While Loop Example. Reading. Recap: Do Loop Example. Recap: For Loop Example

C. Markert-Hahn, K. Schiffl, M. Strohmeier, Nonclinical Statistics Conference,

Loop Transformations for Parallelism & Locality. Review. Scalar Expansion. Scalar Expansion: Motivation

Programming in Fortran 90 : 2017/2018

Image Representation & Visualization Basic Imaging Algorithms Shape Representation and Analysis. outline

A Modified Median Filter for the Removal of Impulse Noise Based on the Support Vector Machines

An Optimal Algorithm for Prufer Codes *

Synthesizer 1.0. User s Guide. A Varying Coefficient Meta. nalytic Tool. Z. Krizan Employing Microsoft Excel 2007

BITPLANE AG IMARISCOLOC. Operating Instructions. Manual Version 1.0 January the image revolution starts here.

Adjustment methods for differential measurement errors in multimode surveys

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Modeling Local Uncertainty accounting for Uncertainty in the Data

UrbaWind, a Computational Fluid Dynamics tool to predict wind resource in urban area

User Authentication Based On Behavioral Mouse Dynamics Biometrics

A new paradigm of fuzzy control point in space curve

Comparison of a Data Imputation Structural Equation Modeling Accuracy Estimation Between Constrained and Unconstrained Approaches

Lecture #15 Lecture Notes

Signature and Lexicon Pruning Techniques

ECONOMICS 452* -- Stata 11 Tutorial 6. Stata 11 Tutorial 6. TOPIC: Representing Multi-Category Categorical Variables with Dummy Variable Regressors

Anonymisation of Public Use Data Sets

The Research of Ellipse Parameter Fitting Algorithm of Ultrasonic Imaging Logging in the Casing Hole

Associative Based Classification Algorithm For Diabetes Disease Prediction

SELECTION OF THE NUMBER OF NEIGHBOURS OF EACH DATA POINT FOR THE LOCALLY LINEAR EMBEDDING ALGORITHM

The Man-hour Estimation Models & Its Comparison of Interim Products Assembly for Shipbuilding

C2 Training: June 8 9, Combining effect sizes across studies. Create a set of independent effect sizes. Introduction to meta-analysis

A Statistical Model Selection Strategy Applied to Neural Networks

Outline. Midterm Review. Declaring Variables. Main Variable Data Types. Symbolic Constants. Arithmetic Operators. Midterm Review March 24, 2014

Chapter 9. Model Calibration. John Hourdakis Center for Transportation Studies, U of Mn

Transcription:

Why vsualsaton? IRDS: Vsualzaton Charles Sutton Unversty of Ednburgh Goal : Have a data set that I want to understand. Ths s called exploratory data analyss. Today s lecture. Goal II: Want to dsplay data (.e., for publcaton) Wll save ths for later lecture (f tme) Fnd or dsplay relatonshps n the data Ths s a prelude to model buldng (what s most mportant to model?) Major goal s nter-ocular mpact Vsualsatons that we won t be nterested n Unvarate data Graphcs provde lttle addtonal nformaton 52.6 47.5 8.8 29.8 6.4 46.2 22. 8.6 23.8 43.7 24.7 33.5 29.3 42.9 29.6 28.9 33.8 23. 37.8 3.3 8.8 28.8 32.7 34.2 32. 32. 2.7 22.7 24.3 23.8 3.7 39.9 34.6 25.7 33.6 29.5 33.6 25. 2. 22.8 3.2 27.4 8.8 4.2 3. 35.8 26.5 4.2 3.4 38.6 29.2 9.4 33.2 22.4 6. 4. 35.7 36.9 4.4 33.2 25.4. 32.9 33.8 35.8 33.7 24.4 5.6 4.8 32.3.3 23.5 39.4 47.8 24.2 25.2 27. 23.8 24.7 26.7 23.2 2.7 33.7 36.6 32. 26. 26.8 57.3 32. 5.5 2.8 3.3 32.2 2.8 7.8 2. 45. 36.4 35.9 27.7 22.6 37.7 7. 39.7 35. 32.3 28.7 26.5 8.7 37.3 26. 37. 2.4 24.6 34.5 34. 3.2 28.5 44.3 23.7 22.9 37.9 34.4 3.8 25.5 27. 28. 2. 45. 27. 35.6 7.2 2.9 4..8 4.2 39.8. 32.9 22.2 25.5 29.6 3. 3.7 38.7 28.8 23. 8. 36.6 34.7 3.4 25.2 22.6 8.5 9.2.3 3.5 3.7 32.3 6.9 33. 45.8 27.2 35. 44.7 23. 4.9 29.6 44.7 27.8 8.2 2.4 24. 3.4 29.8 3.5 2.5 28. 38.7 32.7 32.8 27.3 29.9 42.3 2. 25. 27.2 37.2 2.9 2.7 3.7 2.5 2.7 6.3 4.2 5.9 2.2 7. 28.3 9. 34.9 36.7 32.5 3.8.8 9.7 43.5 35.3 8.6 29. 25.3 26. 44.7 25.3 24. 28. 33.2 29.2 2.7 23.3 3.9 24.2.6 8. 37.7 6. 7.7 8.5 2.2 3. 35.6 28.7 8.5 9.3 2. 2.7 26.5 36.9 24. 4.2 28. 4.6 2.6 28.5 33.5 3.. 32.6 34.2 32.5! For an nterestng perspectve on ths dfference, see: Gelman and Unwn. Infovs and statstcal graphcs: Dfferent goals, dfferent looks (wth dscusson). Journal of Computatonal and Graphcal Statstcs. 23 [source: Wkpeda]

Summares Hstograms Mean 27.7 Std Dev 9.5 Sample mean x = N x Sample standard devaton Mn. Q 2.7 Medan 28. 3Q 33.6 Max 57.3 Medan and quartles 2 6 4 6 8 2 4 6 4 8 2 8 2 4 6 8 2 skew 2 4 6 8 6 8 2 4 multmodalty s x = s N (x x) these three have same summary statstcs! Outlers n hstograms Class-Condtonal Hstograms blood pressure =? Blood pressure data set Frequency Frequency 2 4 6 8 5 5 2 4 6 8 Blood Pressure Postve (dabetes) Negatve Pressure 2 4 6 8 2 Alternatve: Box plot neg Dabetes? pos Quartle Medan Quartle Extreme data 2 4 6 8 UCI ML repostory says no mssng data (well, for 2 years t dd) [Source: Padhrac Smyth] Blood Pressure Maybe for only 2 groups, graphs not necessary. For more vsual comparsons, can be helpful.

Effect of bn sze Effect of bn sze 2 3 4 5 6 5 5 2 25 3 35 2 3 4 5 6 2 3 4 5 2 3 4 5 2 3 4 5 Effect of bn sze More msleadng hstograms 8 9 5 5 2 25 3 35 5 5 2 25 3 7 6 5 4 3 2 2 4 6 8 2 x 4 4 35 3 8 7 6 5 4 3 2 2 4 6 8 2 x 4 2 3 4 5 2 3 4 5 25 2 5 5 5 5 2 25 3 35 4 45 5 Data: US Post Codes [Source: Padhrac Smyth]

Bvarate data Numercal bvarate summares Data are (x,y ), (x 2,y 2 ),...(x N,y N ) Sample covarance: s xy = N (y N ȳ)(x x) Sample correlaton: xy = s xy s x s y = where as before x = N ȳ = N s x = s y = x y s N s N (x x) (y ȳ) Dangers of correlaton Scatterplots 4 6 8 4 4 6 8 4 4 6 8 2 4 4 6 8 4 4 6 8 4 4 6 8 2 4 x2 2 2 2 2 3 x 4 6 8 2 4 8 2 4 6 8 [Anscombe, 973]

Colour n Scatterplots..2.4.6.8...2.4.6.8. Token score after attack Token score before attack [Nelson et al, 28] Each pont s a word Entre plot: one emal Axes: Spam score Colour: Whether token was part of an attack on the spam flter Colour n Scatterplots..2.4.6.8...2.4.6.8. Token score after attack Token score before attack [Nelson et al, 28] For our purposes, note: Use of colour to add a categorcal varable Wthout ths colour would not have seen these two outlers Use of y=x lne to add the eye Overplottng 2 2 3 2 2 x x2 data ponts 3 2 2 3 3 2 2 3 x x2 data ponts 4 2 2 4 4 2 2 4 x x2, data ponts samples from bvarate normal also: notce the axes! 96, bank loan applcants appears: later apps older; realty: downward slope (more apps, more varan [Source: Hand, Manla, and Smyth]

Ftted lne To fx overplottng, could consder: Jtterng ponts Subsamplng ponts (.e., plot only %) Averagng (f ths makes sense) Add trend lnes (e.g., quantle lnes) Ths ft s from loess (local lnear regresson). Tme Seres Examples Fnancal data Network traffc Energy usage Human traffc Buldng occupancy Vsualzaton trcks nclude: Smoothng (runnng mean, medan) Repeated multples Transformatons Consder powers, logs. Occasonally recprocals (e.g., rates). Also square root 2 2 2 2 3 4 5 6 2 2 3 4 5 6 ) 2 3 4 5 6 ) 2 3 4 5 6 [Oh et al, 26], fgure from [uan and Murphy, 27] 5 5-4 -2 2 4 6 Before -4-2 2 4-4 -2 2 4 6 After

Example Transformaton Wat, what f you have categorcal data? Tools here nclude: Colour Contngency tables Multple plots (e.g., class-condtonal hstograms) Why log log here? Hnt: Imagne a sphercal cow [Source: Wllam Cleveland, Vsualzng Data] Three-Dmensonal Data Hgh-Dmensonal Data Generally hard 3-D plots are not usually useful Usually better to use colour on a 2-D plot Or show multple 2D plots for each value of thrd varable Two man optons: Project the data down to 2-D Many technques Prncpal Components Analyss (IAML, MLPR) Multdmensonal scalng Modern nonlnear methods: t-sne, LLE, Isomap, Egenmaps Problem: Sometmes ths wll obscure hgh-d structure and nonlnear structure Another opton: Scatterplot matrx (see next)

Scatterplot matrx Scatterplot matrx Maybe want to use transformed varables up here Colour Ths s performance data for (very old) CPUs Colour Mght be worth understandng ponts lke these Contngency tables Important: Scales must be matched Contngency tables Ths row s the varable we want to predct Ths s the predcton accordng to somebody s model (explans strong relatonshp) What are you lookng for? If you really lke ths stuff Anomales. If somethng looks werd, fgure out why. It could be an error n your data. Learn from your data but do not trust t! (Not completely.) Relatonshps. Hypothess-based vsualzaton. What relatonshps do you expect to exst? Can you see them? Use vsualzaton to nform models and vce versa e.g., Can help wth feature constructon, e.g., whether a relatonshp s really nonlnear Fancy 3D graphs meh These technques also useful for the outputs of learnng! Tukey, Exploratory Data Analyss Bll Cleveland, Vsualzng Data Edward Tufte, all books