CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing University of Florida, CISE Department Prof. Daisy Zhe Wang
Data Visualization Value of Visualization Data And Image Models Visualization Design Exploratory Data Analysis Adapted Slides from Jeffrey Heer at University of Washington
What is visualization? Transformation of the symbolic into the geometric [McCormick et al. 1987]... finding the artificial memory that best supports our natural means of perception. [Bertin 1967] The use of computer-generated, interactive, visual representations of data to amplify cognition. [Card, Mackinlay, & Shneiderman 1999] 3
Data 4
Visual Representation 5
Why visualization? Efficient use of Attention What information consumes is rather obvious: it consumes the attention of its recipients. Hence a wealth of information creates a poverty of attention, and a need to allocate that attention efficiently among the overabundance of information sources that might consume it. Herb Simon as quoted by Hal Varian Scientific American September 1995 6
Why create visualization? Answer questions (or discover them) (e.g., What is the silk road that travels from Europe to China?) Make decisions (e.g., stock market, monitoring system in hospitals) See data in context (e.g., map) Expand memory (e.g., multiplication) Find patterns (e.g., astronomy data, transaction) Present argument or tell a story (e.g., growth of Walmart: http://projects.flowingdata.com/walmart/) Inspire (e.g., textbook medicine, genome, DNA) 7
The Value of Visualization Record information Blueprints, photographs, seismographs, Analyze data to support reasoning Develop and assess hypotheses Discover errors in data Expand memory Find patterns Communicate information to others Share and persuade Collaborate and revise 8
Record information Leonardo da Vinci Map of Imola, created for Cesare Borgia (Up) Proportional of man (Left) 9
Support Reasoning Which animal has the most powerful brain? 10
The most powerful brain? 11
Communicate Information From the New York Times 1981 12
The Value of Visualization Record information Blueprints, photographs, seismographs, Analyze data to support reasoning Develop and assess hypotheses Discover errors in data Expand memory Find patterns Communicate information to others Share and persuade Collaborate and revise 13
Visualization Reference Model 14
Visualization Generation Process 15
Topics Properties of data Properties of images Mapping data to images 16
Data models vs. Conceptual models Data models are low level descriptions of the data (math abstraction) Math: Sets with operations on them Example: integers with + and operators Conceptual models are mental constructions Include semantics and support reasoning Examples (data vs. conceptual) (1D floats) vs. Temperature (3D vector of floats) vs. Space 17
Taxonomy of data types 1D (sets and sequences) Temporal 2D (maps) -- Spatial 3D (shapes) nd (relational) Trees (hierarchies) Networks (graphs) Combination: e.g., spatial + temporal, spatial + relational 18
Types of variables Physical types Characterized by storage format Characterized by machine operations Example: bool, short, int32, float, double, string, Abstract types Provide descriptions of the data May be characterized by methods May be organized into a hierarchy (e.g., ontology) 19
Abstract types of Variables Categorical (data that are counted) Nominal Ordinal Quantitative or Numerical (data that are measured) Interval Ratio Why is the type of variable important? The methods used to display, summarize, and analyze data depend on whether the variables are categorical or quantitative. 20
Categorical: Nominal Nominal Variables that are named, i.e. classified into one or more qualitative categories that describe the characteristic of interest no ordering of the different categories no measure of distance between values categories can be listed in any order without affecting the relationship between them Nominal variables are the simplest type of variable 21
Categorical: Ordinal Ordinal Variables that have an inherent order to the relationship among the different categories an implied ordering of the categories (levels) quantitative distance between levels is unknown distances between the levels may not be the same meaning of different levels may not be the same for different individuals 22
Quantitative/Numerical Interval Variables that have constant, equal distances between values, but the zero point is arbitrary. Ratio Variables have equal intervals between values, the zero point is meaningful, and the numerical relationships between numbers is meaningful. Continuous vs. discrete 23
Nominal, Ordinal and Quantitative N - Nominal (labels) Fruits: Apples, oranges, O Ordinal (ordered list) Quality of meat: Grade A, AA, AAA Q - Interval (Location of zero arbitrary) Dates: Jan, 19, 2006; Location: (LAT 33.98, LONG -118.45) Cannot compare directly Only differences (i.e. intervals) may be compared Q - Ratio (zero fixed) Physical measurement: Length, Mass, Temp, Counts and amounts Origin is meaningful 24
Level of Measurement Higher level variables can always be expressed at a lower level, but the reverse is not true. Q > O > N For example, Body Mass Index (BMI) is typically measured at an interval-level such as 23.4. BMI can be collapsed into lower-level Ordinal categories such as: >30: Obese 25-29.9: Overweight <25: Underweight or Nominal categories such as: Overweight Not overweight 25
Operations on N,O,Q Data Types N - Nominal (labels) Operations: =, O Ordinal (ordered list) Operations: =,, <, > Q - Interval (Location of zero arbitrary) Operations: =,, <, >, - Can measure distances or spans Q - Ratio (zero fixed) Operations: =,, <, >, -, % Can measure ratios or proportions 26
From data models to N,O,Q data types Data model 32.5, 54.0, -17.3, floats Conceptual model Temperature ( C) Data type Burned vs. Not burned (N) Hot, warm, cold (O) Continuous range of values (Q) 27
Example Sepal and petal lengths and widths for three species of iris [Fisher 1936]. 28
Example Sepal and petal lengths and widths for three species of iris [Fisher 1936]. 29
Relational data model Represent data as a table (relation) Each row (tuple) represents a single record Each record is a fixed-length tuple Each column (attribute) represents a single variable Each attribute has a name and a data type A table s schema is the set of names and data types A database is a collection of tables (relations) 30
Relational Algebra [Codd] Data transformations (sql) Projection (select) Selection (where) Sorting (order by) Aggregation (group by, sum, min, ) Set operations (union, ) Combine (inner join, outer join, ) 31
Statistical data model Variables or measurements Categories or factors or dimensions Observations or cases 32
Dimensions and Measures Dimensions: Discrete variables describing data Dates, categories of values (independent vars) Measures: Data values that can be aggregated Numbers to be analyzed (dependent vars) Aggregate as sum, count, average, std. deviation 33
Example: U.S. Census Data People: # of people in group Year: 1850 2000 (every decade) Age: 0 90+ Sex: Male, Female Marital Status: Single, Married, Divorced, 34
Example: U.S. Census People Year Age Sex Marital Status 2348 data points 35
Census: N, O, Q (R/I)? People Count Year Age Sex (M/F) Marital Status Q-Ratio Q-Interval (O) Q-Ratio (O) N N 36
Census: Measure or Dimension? People Count Year Age Sex (M/F) Marital Status Measure Dimension Dimension Dimension Dimension 37