Why vsualsaton? IRDS: Vsualzaton Charles Sutton Unversty of Ednburgh Goal : Have a data set that I want to understand. Ths s called exploratory data analyss. Today s lecture. Goal II: Want to dsplay data (.e., for publcaton) Wll save ths for later lecture (f tme) Fnd or dsplay relatonshps n the data Ths s a prelude to model buldng (what s most mportant to model?) Major goal s nter-ocular mpact Vsualsatons that we won t be nterested n Unvarate data Graphcs provde lttle addtonal nformaton 52.6 47.5 8.8 29.8 6.4 46.2 22. 8.6 23.8 43.7 24.7 33.5 29.3 42.9 29.6 28.9 33.8 23. 37.8 3.3 8.8 28.8 32.7 34.2 32. 32. 2.7 22.7 24.3 23.8 3.7 39.9 34.6 25.7 33.6 29.5 33.6 25. 2. 22.8 3.2 27.4 8.8 4.2 3. 35.8 26.5 4.2 3.4 38.6 29.2 9.4 33.2 22.4 6. 4. 35.7 36.9 4.4 33.2 25.4. 32.9 33.8 35.8 33.7 24.4 5.6 4.8 32.3.3 23.5 39.4 47.8 24.2 25.2 27. 23.8 24.7 26.7 23.2 2.7 33.7 36.6 32. 26. 26.8 57.3 32. 5.5 2.8 3.3 32.2 2.8 7.8 2. 45. 36.4 35.9 27.7 22.6 37.7 7. 39.7 35. 32.3 28.7 26.5 8.7 37.3 26. 37. 2.4 24.6 34.5 34. 3.2 28.5 44.3 23.7 22.9 37.9 34.4 3.8 25.5 27. 28. 2. 45. 27. 35.6 7.2 2.9 4..8 4.2 39.8. 32.9 22.2 25.5 29.6 3. 3.7 38.7 28.8 23. 8. 36.6 34.7 3.4 25.2 22.6 8.5 9.2.3 3.5 3.7 32.3 6.9 33. 45.8 27.2 35. 44.7 23. 4.9 29.6 44.7 27.8 8.2 2.4 24. 3.4 29.8 3.5 2.5 28. 38.7 32.7 32.8 27.3 29.9 42.3 2. 25. 27.2 37.2 2.9 2.7 3.7 2.5 2.7 6.3 4.2 5.9 2.2 7. 28.3 9. 34.9 36.7 32.5 3.8.8 9.7 43.5 35.3 8.6 29. 25.3 26. 44.7 25.3 24. 28. 33.2 29.2 2.7 23.3 3.9 24.2.6 8. 37.7 6. 7.7 8.5 2.2 3. 35.6 28.7 8.5 9.3 2. 2.7 26.5 36.9 24. 4.2 28. 4.6 2.6 28.5 33.5 3.. 32.6 34.2 32.5! For an nterestng perspectve on ths dfference, see: Gelman and Unwn. Infovs and statstcal graphcs: Dfferent goals, dfferent looks (wth dscusson). Journal of Computatonal and Graphcal Statstcs. 23 [source: Wkpeda]
Summares Hstograms Mean 27.7 Std Dev 9.5 Sample mean x = N x Sample standard devaton Mn. Q 2.7 Medan 28. 3Q 33.6 Max 57.3 Medan and quartles 2 6 4 6 8 2 4 6 4 8 2 8 2 4 6 8 2 skew 2 4 6 8 6 8 2 4 multmodalty s x = s N (x x) these three have same summary statstcs! Outlers n hstograms Class-Condtonal Hstograms blood pressure =? Blood pressure data set Frequency Frequency 2 4 6 8 5 5 2 4 6 8 Blood Pressure Postve (dabetes) Negatve Pressure 2 4 6 8 2 Alternatve: Box plot neg Dabetes? pos Quartle Medan Quartle Extreme data 2 4 6 8 UCI ML repostory says no mssng data (well, for 2 years t dd) [Source: Padhrac Smyth] Blood Pressure Maybe for only 2 groups, graphs not necessary. For more vsual comparsons, can be helpful.
Effect of bn sze Effect of bn sze 2 3 4 5 6 5 5 2 25 3 35 2 3 4 5 6 2 3 4 5 2 3 4 5 2 3 4 5 Effect of bn sze More msleadng hstograms 8 9 5 5 2 25 3 35 5 5 2 25 3 7 6 5 4 3 2 2 4 6 8 2 x 4 4 35 3 8 7 6 5 4 3 2 2 4 6 8 2 x 4 2 3 4 5 2 3 4 5 25 2 5 5 5 5 2 25 3 35 4 45 5 Data: US Post Codes [Source: Padhrac Smyth]
Bvarate data Numercal bvarate summares Data are (x,y ), (x 2,y 2 ),...(x N,y N ) Sample covarance: s xy = N (y N ȳ)(x x) Sample correlaton: xy = s xy s x s y = where as before x = N ȳ = N s x = s y = x y s N s N (x x) (y ȳ) Dangers of correlaton Scatterplots 4 6 8 4 4 6 8 4 4 6 8 2 4 4 6 8 4 4 6 8 4 4 6 8 2 4 x2 2 2 2 2 3 x 4 6 8 2 4 8 2 4 6 8 [Anscombe, 973]
Colour n Scatterplots..2.4.6.8...2.4.6.8. Token score after attack Token score before attack [Nelson et al, 28] Each pont s a word Entre plot: one emal Axes: Spam score Colour: Whether token was part of an attack on the spam flter Colour n Scatterplots..2.4.6.8...2.4.6.8. Token score after attack Token score before attack [Nelson et al, 28] For our purposes, note: Use of colour to add a categorcal varable Wthout ths colour would not have seen these two outlers Use of y=x lne to add the eye Overplottng 2 2 3 2 2 x x2 data ponts 3 2 2 3 3 2 2 3 x x2 data ponts 4 2 2 4 4 2 2 4 x x2, data ponts samples from bvarate normal also: notce the axes! 96, bank loan applcants appears: later apps older; realty: downward slope (more apps, more varan [Source: Hand, Manla, and Smyth]
Ftted lne To fx overplottng, could consder: Jtterng ponts Subsamplng ponts (.e., plot only %) Averagng (f ths makes sense) Add trend lnes (e.g., quantle lnes) Ths ft s from loess (local lnear regresson). Tme Seres Examples Fnancal data Network traffc Energy usage Human traffc Buldng occupancy Vsualzaton trcks nclude: Smoothng (runnng mean, medan) Repeated multples Transformatons Consder powers, logs. Occasonally recprocals (e.g., rates). Also square root 2 2 2 2 3 4 5 6 2 2 3 4 5 6 ) 2 3 4 5 6 ) 2 3 4 5 6 [Oh et al, 26], fgure from [uan and Murphy, 27] 5 5-4 -2 2 4 6 Before -4-2 2 4-4 -2 2 4 6 After
Example Transformaton Wat, what f you have categorcal data? Tools here nclude: Colour Contngency tables Multple plots (e.g., class-condtonal hstograms) Why log log here? Hnt: Imagne a sphercal cow [Source: Wllam Cleveland, Vsualzng Data] Three-Dmensonal Data Hgh-Dmensonal Data Generally hard 3-D plots are not usually useful Usually better to use colour on a 2-D plot Or show multple 2D plots for each value of thrd varable Two man optons: Project the data down to 2-D Many technques Prncpal Components Analyss (IAML, MLPR) Multdmensonal scalng Modern nonlnear methods: t-sne, LLE, Isomap, Egenmaps Problem: Sometmes ths wll obscure hgh-d structure and nonlnear structure Another opton: Scatterplot matrx (see next)
Scatterplot matrx Scatterplot matrx Maybe want to use transformed varables up here Colour Ths s performance data for (very old) CPUs Colour Mght be worth understandng ponts lke these Contngency tables Important: Scales must be matched Contngency tables Ths row s the varable we want to predct Ths s the predcton accordng to somebody s model (explans strong relatonshp) What are you lookng for? If you really lke ths stuff Anomales. If somethng looks werd, fgure out why. It could be an error n your data. Learn from your data but do not trust t! (Not completely.) Relatonshps. Hypothess-based vsualzaton. What relatonshps do you expect to exst? Can you see them? Use vsualzaton to nform models and vce versa e.g., Can help wth feature constructon, e.g., whether a relatonshp s really nonlnear Fancy 3D graphs meh These technques also useful for the outputs of learnng! Tukey, Exploratory Data Analyss Bll Cleveland, Vsualzng Data Edward Tufte, all books