Exploratory/Visual Data Analysis
|
|
- Gerald French
- 5 years ago
- Views:
Transcription
1 Exploratory/Visual Data Analysis Intelligent Data Analysis 9/14/2018 Budapest University of Technology and Economics Fault Tolerant Systems Research Group Budapesti Műszaki és Gazdaságtudományi Egyetem Méréstechnika 1 és Információs Rendszerek Tanszék 1
2 Data mining brickstones Clustering Classification Association rules Regression 2 2
3 Clustering 3 3
4 Association rules 4 4
5 Classification 5 5
6 Regression 6 6
7 Model Data analysis New knowledge Data THE SYSTEMATIC WAY: EXPLORATORY DATA ANALYSIS Look and see 7 7
8 Dr. Snow and 1854 London cholera epidemics About half of our sensory neurons are dedicated to vision, endowing us with a remarkable pattern-recognition ability. Prof. Alfred Inselberg 8 8
9 Exploratory data analysis (EDA) Summary of the main characteristics of a data set o Identification of outliers, trends, other patterns o Often with visual methods. o A statistical model can be used or not, For seeing what the data can tell o beyond formal modeling or hypothesis testing o hypotheses new data collection and experiments 9 9
10 Exploratory data analysis (EDA) Approach to analyzing data sets o Summary of their main characteristics in Easy-to-understand, often: visual graphs without using statistical model or having formulated a hypothesis. Prepares Confirmatory Data Analysis o Checking statistical model or a hypothesis. Dynamic visualization capabilities o Identification of outliers, trends and patterns Reduction of the sensitivity of inferences to errors : o Robust statistics - low sensitivity to outliers o Nonparametric statistics no a priori assumptions 10 10
11 Approach-visual exploratory analytics Resources sensors processors Process based on interactivity 1. Graphical presentation of the data multiple diagrams 2. Visual evaluation exploiting human overview 3. Visual selection, manipulation multiple diagrams 4. Interpretation, correlation with other models, evaluation (like architecture etc.)
12 Visualisation in Everyday Life Trend Analysis and Forcast Time Series Analysis Correlation Analysis Analysis of Spatial Data 12 12
13 Additional knowledge People buying coffee often buy milk There is a significant difference in salaries depending on gender The memory consumption of a software grows exponentially wrt. number of requests in queue. The population follows a N(100, 15) disribution BME students fall into 3 main different groups (according to their grades) 13 13
14 Cleaning Data analysis Modell Data Data analysis Exploratory Confirmatory Additional knowledge 14 14
15 Data analysis Exploratory analysis Goal: formulate hypotheses Know the data/domain Highly ad-hoc Mainly descriptive statistics+data mining+visualization Confirmatory analízis Goal: test hypotheses Validate Mainly statistical tests + statistical inference 15 15
16 Data analysis E.g. distribution analysis Exploratory Hypothesis: variable x follows normal distribution Confirmatory Variable x follows N(12, 4) distribution 16 16
17 Data analysis E.g., linear regression Exploratory Hypothesis: variables x and y are linearly correlated Confirmatory Y = 3.2 x 1.2 R 2 =
18 Cleaning Data analysis Modell Data Data analysis Exploratory Confirmatory Additional knowledge 18 18
19 Data cleaning Reasons for inconsistencies o Input/measurement errors, wrong join, partial observation, fraud, Outliers o =/= gross error o Can be an analysis goal to identify o Hard to detect in case of multiple dimensions o Visualization with (low) dimensions number may help 19 19
20 Boxplot: five-number summary Q1 Q3 Min. Median Max. Q1 1.5 IQR Normal Extr. small Small Large Extr. large 20 20
21 Example: Robust and non-paramteric measures Sample set o 1000 points o U(1, 5) unif. Distribution mean = median = 3 ms 3ms ± 2 ms 1 sample of 20 sec Response time New median: sort(resp. times)[501] = 3.02 ms Resp. t. median Resp. t. mean Robust Non-robust New mean: (2 * 10^4 + 3 * 10^3 )/ 1001 = 25 ms! 21 21
22 Parallel coordinates: multi(many)factor analysis You can do much, much more o Redundancy reduction o Correlation analysis o Clustering o Data mining o Approximation o Optimization Inselberg, A: Parallel Coordinates, Visual Multidimensional Geometry and Its Applications, Springer
23 Example: host CPU usage Host CPU usage Ratio of bad vcpu ready Different failure domains, as outlier clusters Enough slack Better workload management Scheduling helps 23 23
24 Example: Strange observation Computing power use = CPU use ratio CPU clk rate (const.) Should be pure proportional Correlation coefficient:
25 Repetition: Basic statistics Min: Max: Mean: Variance: smallest value gratest value x ҧ = x 1+x 2 + +x n n σ 2 = σ n i=1 (xi x ҧ ) 2 n Think Stats: Probability and Statistics for Programmers 25 25
26 Reviewing the Calculations Avoiding erroneous assumptions Intuition! Overly primitive model Single outlier! Lack of robustness For all cases: Means: M x = 9 M y ~7.5 Variance: σ x = 11 σ y ~4.12 Correlation: C x, y ~0.816 Regression: y ~ x
27 Example: Visualisation of State Spaces
28 Example: State Space of the CAN Bus
29 Example: System Model Performance Model Modelling System model Design of the Analysis Expectations Data Collection Measurements Benchmarks Qualitative model Performance model Parameters to measure Design of the experiments Simulation Analysis Exploratory Analysis Hypothesis testing 29 29
30 Example: performance analysis Outliers/background operations Deviations?? Linearity Unpredictable behavior
31 Reminder: Tabular Representation Rows of the table = Model elements Columns of the table = Properties Name Type Size (kb) Last modified Documents directory Contracts.pdf file Pictures directory Logo.png file Groundplot.jpg file Data analysis languages (e.g. R, Python): dataframe o One row: one measurement/observation o Columns have their own Types 31 31
32 Numerical and Categorical Variables Numerical o Arithmetic operations are interpreted meaningful (average, sum, inc, dec, ) Variables Categorical o No operation between the values Numerical Categorical Age Phone number Average temperature Sex 32 32
33 Numerical Variables Continuous o Measured in a specific range with a specific precision o e.g. temperature of a the server room Numerical Variable Categorical Integer Continuous Discrete o Counted finite number of values in a specific range o e.g. number of disks 33 33
34 Categorical variables Ordered (ordinal) o Excelent, good, fair, poor o Value range is ordered o Fully ordered? Un-ordered (nominal) o Types o City names Numerical Ordered Variables Categorical Un-ordered 34 34
35 Single variable Variables Változók Numerikus Numerical Kategorikus Categorical Test results: [28, 28, 30, ] Training groups: [G01, G02, G03, ] 35 35
36 Column Charts / Bar Charts Input variable: Course codes Question: How many students have subscribed? Are there popular time slots / trainers? Absolute Frequency! Height of bar: Frequency of the given value Design decision: Splitting of the value range e.g.: Tuesday-Thursday-Friday? 36 36
37 Single variable Variables Változók Numerikus Numerical Kategorikus Categorical Test results: [28, 28, 30, ] Training groups: [G01, G02, G03, ] 37 37
38 Histogram Input variable: Test results Question: What results were born? Height of bar : Frequency of the given interval of values Absolute Frequency! Design decision : Choosing length of the intervals e.g.: 1-point-resolution vs. 0,5-point-resolution? 38 38
39 Histogram Input variable: Test results Question: What results were born? Many have the entry test almost made. The most of those who made the entry test, made the whole test also. Those not appeared at all. The average/ median was at 20 points
40 Simple Statistical Description Where is the middle of the values? 40 40
41 Simple Statistical Description How far are the values scattered? 41 41
42 Simple Statistical Description Are there outliers? 42 42
43 Box plots Input variable: Test results Question: What results were born approximately? An art of abstraction: just take some intervals, the exact values are not that important 43 43
44 Description of (Continuous) Observations Description of the middle 1. Average arithmetic mean 2. Median the element separating the upper half from the lower half (ordered data sets!) 3. Mode the most frequent element o Example: {3, 4, 4, 5, 5, 6, 10, 20} Mean: ~ Median: 5 Mode: 4 and 5 (often as 4.5) Description of the spread o Percentiles (frequency for categorical types) 44 44
45 Describing (cont.) Observations If the elements of a data set are ordered, the middle element is the median of the data set. In the case if there is no middle element (an even number of elements), the median is the average of the two middle elements. The mode is the most frequent element (the most frequent elements) of the data set. If there is no unique most frequent element, the data set has multiple Modes
46 Percentiles Percentile o n% of the values are weaker than the n th percentile o {3, 4, 4, 5, 5, 6, 10, 20} 50. percentile: percentile: percentile: 10 Quartiles o Q1: 25. percentile o Q3: 75. percentile o Q2: Median Frequency: n% of the values lie in the given categorie(s) 46 46
47 Example: Representation of Percentiles 47 47
48 Box and whisker plot 48 48
49 Box and whisker plot Limit of the upper outliers Q3 Median Q1 Limit of the lower outliers How to do it in Excel? (
50 Box and whisker plot Task2 has produced (in average) better results than Task1. 50% of the results of Task1 lie between 4.5 and
51 Box and whisker plot How were the results per training groups? Very good entry tests in G06, G11 and G17 G08? 51 51
52 Box and whisker plot Abstraction: With box plot we can miss important information
53 Boxplot (Box and whisker plot) Interquartile range ±1.5 IQR is nearly 3σ 53 53
54 Boxplots vs qualitative domains Min. Median Max. Qualitative Extr. small Small Normal Large Extr. large Q1 1.5 IQR Quantitative Extr. small Small Normal Large Extr. large 54 54
55 Median instead of Mean Why? Data set: 1000 Points in (1, 5) with uniform distribution Mean = Median = 3 ms 3ms ± 2 ms 1 element: 20 s Response time Median Mean New Median: sort(resp. times)[501] = ms Robust Not robust New Mean: (2 * 10^4 + 3 * 10^3 )/ 1001 = 23 ms! 55 55
56 Relation between two Variables Variable Numerical Categorical 2 numerical 2 categorical 1 numerical, 1 categorical 56 56
57 Numerical, per Category 57 57
58 Relation between two Variables Variable Variable Numerical Categorical Numerical Categorical 2 numerical 1 numerical, 1 categorical 2 categorical 58 58
59 Scatter Plot Input variable: Results of the two main test tasks Question: Is there any correlation? Pairs of results are displayed (visualised). Where one of them is missing, the pair cannot be shown
60 Scatter Plot Input variable: Results of the two main test tasks Question: Is there any correlation? Who had a good result for Task1, had not necessarily a good one for Task2. How should overlapping be handled? 60 60
61 Overplotting 61 61
62 Overplotting Solution 1: Jitter 62 62
63 Overplotting Solution 2: Transparency 63 63
64 Overplotting Solution 3: Size 64 64
65 Relation between two Variables Variable Variable Numerical Categorical Numerical Categorical 1 numerical, 2 numerical 2 categorical 1 categorical 65 65
66 Mosaic Plot Relation between 2 or more categorical variables Guardian's list of "1000 songs to hear before you die"
67 MULTIPLE VARIABLES 67 67
68 More than Two Variables Changing the properties of the graphical elements o Color o Size o Texture o Place (non-trivial way, but look at tree maps, there the place has a direct meaning) E.g. bubble chart, heatmap, treemap 68 68
69 3D Plot 69 69
70 Bubble Chart: Average Age by Regions 70 70
71 Scatterplot matrix 71 71
72 Heatmap: Operations Statistics We mostly write simple texts. In Start we only start, but we never return
73 Pairwise Correlation of Multiple Values variable names independent values blue: positive, red: inverse darker colours: stronger correlation strong correlation Drawn with the corrgram package of the statistics software R. Correlation (see Probability Theory): Strength and direction of the linear dependency between two variables Over the diagonal: scatterplot matrix Goal: Filtering out the related variables, identification of the outliers. Which variables are important for the prediction of the load? 73 73
74 Tree Map: e.g. File System 74 74
75 Parallel Coordinates Multi-dimensional visualization Compact, scalable Axis order? [2, 4, 4, 5, 6] [3, 6, 6, 4, 5] [1, 2, 2, 3, 3] [3, 4, 6, 5, 6] [2, 5, 4, 6, 4] [1, 3, 2, 3, 2] A B C D E A D C E B 75 75
76 Parallel Coordinates: Analysis of the Test Cases 1 test case: 1 broken line The variables appear on the x-axis 76 76
77 Parallel Coordinates: Analysis of the Test Cases Timeout? The ones detecting an error did not even came to the actual computation. Run time and memory usage seem to be in a positive relation (if the test is successful) 77 77
78 Parallel Coordinates: the Alternatives 78 78
79 Radar Chart: An Extension of Parallel Coord
80 References Tukey, John W. "Exploratory data analysis." (1977): 2. Inselberg, Alfred, and Bernard Dimsdale. "Parallel coordinates for visualizing multi-dimensional geometry." Computer Graphics Springer, Tokyo, Kocsis Imre. Vizuális analízis. In: Antal Péter, Antos András, Horváth Gábor, Hullám Gábor, Kocsis Imre, Marx Péter, Millinghoffer András, Pataricza András, Salánki Ágnes. Intelligens adatelemzés. 141 p. Budapest: Typotex Kiadó, pp (ISBN: ) 80 80
Adatfeldolgozás és elemzés (Data processing and analysis)
Adatfeldolgozás és elemzés (Data processing and analysis) Intelligens Elosztott Rendszerek http://www.mit.bme.hu/oktatas/targyak/vimiac02 Budapest University of Technology and Economics Fault Tolerant
More informationPart I, Chapters 4 & 5. Data Tables and Data Analysis Statistics and Figures
Part I, Chapters 4 & 5 Data Tables and Data Analysis Statistics and Figures Descriptive Statistics 1 Are data points clumped? (order variable / exp. variable) Concentrated around one value? Concentrated
More informationECLT 5810 Data Preprocessing. Prof. Wai Lam
ECLT 5810 Data Preprocessing Prof. Wai Lam Why Data Preprocessing? Data in the real world is imperfect incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate
More informationData can be in the form of numbers, words, measurements, observations or even just descriptions of things.
+ What is Data? Data is a collection of facts. Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. In most cases, data needs to be interpreted and
More informationAcquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.
Summary Statistics Acquisition Description Exploration Examination what data is collected Characterizing properties of data. Exploring the data distribution(s). Identifying data quality problems. Selecting
More informationTable of Contents (As covered from textbook)
Table of Contents (As covered from textbook) Ch 1 Data and Decisions Ch 2 Displaying and Describing Categorical Data Ch 3 Displaying and Describing Quantitative Data Ch 4 Correlation and Linear Regression
More informationSTA 570 Spring Lecture 5 Tuesday, Feb 1
STA 570 Spring 2011 Lecture 5 Tuesday, Feb 1 Descriptive Statistics Summarizing Univariate Data o Standard Deviation, Empirical Rule, IQR o Boxplots Summarizing Bivariate Data o Contingency Tables o Row
More informationData Mining and Analytics. Introduction
Data Mining and Analytics Introduction Data Mining Data mining refers to extracting or mining knowledge from large amounts of data It is also termed as Knowledge Discovery from Data (KDD) Mostly, data
More informationSTA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 3 Descriptive Measures
STA 2023 Module 3 Descriptive Measures Learning Objectives Upon completing this module, you should be able to: 1. Explain the purpose of a measure of center. 2. Obtain and interpret the mean, median, and
More informationData Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha
Data Preprocessing S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha 1 Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking
More informationChapter 2 Describing, Exploring, and Comparing Data
Slide 1 Chapter 2 Describing, Exploring, and Comparing Data Slide 2 2-1 Overview 2-2 Frequency Distributions 2-3 Visualizing Data 2-4 Measures of Center 2-5 Measures of Variation 2-6 Measures of Relative
More informationSTA Module 2B Organizing Data and Comparing Distributions (Part II)
STA 2023 Module 2B Organizing Data and Comparing Distributions (Part II) Learning Objectives Upon completing this module, you should be able to 1 Explain the purpose of a measure of center 2 Obtain and
More informationSTA Learning Objectives. Learning Objectives (cont.) Module 2B Organizing Data and Comparing Distributions (Part II)
STA 2023 Module 2B Organizing Data and Comparing Distributions (Part II) Learning Objectives Upon completing this module, you should be able to 1 Explain the purpose of a measure of center 2 Obtain and
More informationAverages and Variation
Averages and Variation 3 Copyright Cengage Learning. All rights reserved. 3.1-1 Section 3.1 Measures of Central Tendency: Mode, Median, and Mean Copyright Cengage Learning. All rights reserved. 3.1-2 Focus
More informationData Statistics Population. Census Sample Correlation... Statistical & Practical Significance. Qualitative Data Discrete Data Continuous Data
Data Statistics Population Census Sample Correlation... Voluntary Response Sample Statistical & Practical Significance Quantitative Data Qualitative Data Discrete Data Continuous Data Fewer vs Less Ratio
More informationLAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA
LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA This lab will assist you in learning how to summarize and display categorical and quantitative data in StatCrunch. In particular, you will learn how to
More informationPrepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.
Chapter 2 2.1 Descriptive Statistics A stem-and-leaf graph, also called a stemplot, allows for a nice overview of quantitative data without losing information on individual observations. It can be a good
More informationThings you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs.
1 2 Things you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs. 2. How to construct (in your head!) and interpret confidence intervals.
More informationCS6220: DATA MINING TECHNIQUES
CS6220: DATA MINING TECHNIQUES 2: Data Pre-Processing Instructor: Yizhou Sun yzsun@ccs.neu.edu September 10, 2013 2: Data Pre-Processing Getting to know your data Basic Statistical Descriptions of Data
More informationChapter 3: Data Description - Part 3. Homework: Exercises 1-21 odd, odd, odd, 107, 109, 118, 119, 120, odd
Chapter 3: Data Description - Part 3 Read: Sections 1 through 5 pp 92-149 Work the following text examples: Section 3.2, 3-1 through 3-17 Section 3.3, 3-22 through 3.28, 3-42 through 3.82 Section 3.4,
More informationIAT 355 Visual Analytics. Data and Statistical Models. Lyn Bartram
IAT 355 Visual Analytics Data and Statistical Models Lyn Bartram Exploring data Example: US Census People # of people in group Year # 1850 2000 (every decade) Age # 0 90+ Sex (Gender) # Male, female Marital
More informationM7D1.a: Formulate questions and collect data from a census of at least 30 objects and from samples of varying sizes.
M7D1.a: Formulate questions and collect data from a census of at least 30 objects and from samples of varying sizes. Population: Census: Biased: Sample: The entire group of objects or individuals considered
More informationLearner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display
CURRICULUM MAP TEMPLATE Priority Standards = Approximately 70% Supporting Standards = Approximately 20% Additional Standards = Approximately 10% HONORS PROBABILITY AND STATISTICS Essential Questions &
More informationVocabulary. 5-number summary Rule. Area principle. Bar chart. Boxplot. Categorical data condition. Categorical variable.
5-number summary 68-95-99.7 Rule Area principle Bar chart Bimodal Boxplot Case Categorical data Categorical variable Center Changing center and spread Conditional distribution Context Contingency table
More informationChapter 1. Looking at Data-Distribution
Chapter 1. Looking at Data-Distribution Statistics is the scientific discipline that provides methods to draw right conclusions: 1)Collecting the data 2)Describing the data 3)Drawing the conclusions Raw
More informationChapter 2. Descriptive Statistics: Organizing, Displaying and Summarizing Data
Chapter 2 Descriptive Statistics: Organizing, Displaying and Summarizing Data Objectives Student should be able to Organize data Tabulate data into frequency/relative frequency tables Display data graphically
More informationData Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining
Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar What is data exploration? A preliminary exploration of the data to better understand its characteristics.
More informationData Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining
Data Mining: Exploring Data Lecture Notes for Data Exploration Chapter Introduction to Data Mining by Tan, Steinbach, Karpatne, Kumar 02/03/2018 Introduction to Data Mining 1 What is data exploration?
More informationData Analyst Nanodegree Syllabus
Data Analyst Nanodegree Syllabus Discover Insights from Data with Python, R, SQL, and Tableau Before You Start Prerequisites : In order to succeed in this program, we recommend having experience working
More informationChapter 3 - Displaying and Summarizing Quantitative Data
Chapter 3 - Displaying and Summarizing Quantitative Data 3.1 Graphs for Quantitative Data (LABEL GRAPHS) August 25, 2014 Histogram (p. 44) - Graph that uses bars to represent different frequencies or relative
More informationData Mining: Exploring Data. Lecture Notes for Chapter 3
Data Mining: Exploring Data Lecture Notes for Chapter 3 1 What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include
More informationUnivariate Statistics Summary
Further Maths Univariate Statistics Summary Types of Data Data can be classified as categorical or numerical. Categorical data are observations or records that are arranged according to category. For example:
More informationData Analyst Nanodegree Syllabus
Data Analyst Nanodegree Syllabus Discover Insights from Data with Python, R, SQL, and Tableau Before You Start Prerequisites : In order to succeed in this program, we recommend having experience working
More informationSummarising Data. Mark Lunt 09/10/2018. Arthritis Research UK Epidemiology Unit University of Manchester
Summarising Data Mark Lunt Arthritis Research UK Epidemiology Unit University of Manchester 09/10/2018 Summarising Data Today we will consider Different types of data Appropriate ways to summarise these
More informationChapter 6: DESCRIPTIVE STATISTICS
Chapter 6: DESCRIPTIVE STATISTICS Random Sampling Numerical Summaries Stem-n-Leaf plots Histograms, and Box plots Time Sequence Plots Normal Probability Plots Sections 6-1 to 6-5, and 6-7 Random Sampling
More informationName Date Types of Graphs and Creating Graphs Notes
Name Date Types of Graphs and Creating Graphs Notes Graphs are helpful visual representations of data. Different graphs display data in different ways. Some graphs show individual data, but many do not.
More informationCHAPTER 2: SAMPLING AND DATA
CHAPTER 2: SAMPLING AND DATA This presentation is based on material and graphs from Open Stax and is copyrighted by Open Stax and Georgia Highlands College. OUTLINE 2.1 Stem-and-Leaf Graphs (Stemplots),
More informationYour Name: Section: INTRODUCTION TO STATISTICAL REASONING Computer Lab #4 Scatterplots and Regression
Your Name: Section: 36-201 INTRODUCTION TO STATISTICAL REASONING Computer Lab #4 Scatterplots and Regression Objectives: 1. To learn how to interpret scatterplots. Specifically you will investigate, using
More informationMinitab 17 commands Prepared by Jeffrey S. Simonoff
Minitab 17 commands Prepared by Jeffrey S. Simonoff Data entry and manipulation To enter data by hand, click on the Worksheet window, and enter the values in as you would in any spreadsheet. To then save
More informationDAY 52 BOX-AND-WHISKER
DAY 52 BOX-AND-WHISKER VOCABULARY The Median is the middle number of a set of data when the numbers are arranged in numerical order. The Range of a set of data is the difference between the highest and
More informationIntroduction to Geospatial Analysis
Introduction to Geospatial Analysis Introduction to Geospatial Analysis 1 Descriptive Statistics Descriptive statistics. 2 What and Why? Descriptive Statistics Quantitative description of data Why? Allow
More informationChapter 2: Descriptive Statistics
Chapter 2: Descriptive Statistics Student Learning Outcomes By the end of this chapter, you should be able to: Display data graphically and interpret graphs: stemplots, histograms and boxplots. Recognize,
More informationWELCOME! Lecture 3 Thommy Perlinger
Quantitative Methods II WELCOME! Lecture 3 Thommy Perlinger Program Lecture 3 Cleaning and transforming data Graphical examination of the data Missing Values Graphical examination of the data It is important
More informationCHAPTER-13. Mining Class Comparisons: Discrimination between DifferentClasses: 13.4 Class Description: Presentation of Both Characterization and
CHAPTER-13 Mining Class Comparisons: Discrimination between DifferentClasses: 13.1 Introduction 13.2 Class Comparison Methods and Implementation 13.3 Presentation of Class Comparison Descriptions 13.4
More informationRoad Map. Data types Measuring data Data cleaning Data integration Data transformation Data reduction Data discretization Summary
2. Data preprocessing Road Map Data types Measuring data Data cleaning Data integration Data transformation Data reduction Data discretization Summary 2 Data types Categorical vs. Numerical Scale types
More information2.1: Frequency Distributions and Their Graphs
2.1: Frequency Distributions and Their Graphs Frequency Distribution - way to display data that has many entries - table that shows classes or intervals of data entries and the number of entries in each
More informationBrief Guide on Using SPSS 10.0
Brief Guide on Using SPSS 10.0 (Use student data, 22 cases, studentp.dat in Dr. Chang s Data Directory Page) (Page address: http://www.cis.ysu.edu/~chang/stat/) I. Processing File and Data To open a new
More informationAP Statistics Summer Assignment:
AP Statistics Summer Assignment: Read the following and use the information to help answer your summer assignment questions. You will be responsible for knowing all of the information contained in this
More information10.4 Measures of Central Tendency and Variation
10.4 Measures of Central Tendency and Variation Mode-->The number that occurs most frequently; there can be more than one mode ; if each number appears equally often, then there is no mode at all. (mode
More information10.4 Measures of Central Tendency and Variation
10.4 Measures of Central Tendency and Variation Mode-->The number that occurs most frequently; there can be more than one mode ; if each number appears equally often, then there is no mode at all. (mode
More informationMath 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency
Math 1 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency lowest value + highest value midrange The word average: is very ambiguous and can actually refer to the mean,
More informationGRAPHING BAYOUSIDE CLASSROOM DATA
LUMCON S BAYOUSIDE CLASSROOM GRAPHING BAYOUSIDE CLASSROOM DATA Focus/Overview This activity allows students to answer questions about their environment using data collected during water sampling. Learning
More informationThe basic arrangement of numeric data is called an ARRAY. Array is the derived data from fundamental data Example :- To store marks of 50 student
Organizing data Learning Outcome 1. make an array 2. divide the array into class intervals 3. describe the characteristics of a table 4. construct a frequency distribution table 5. constructing a composite
More informationAP Statistics Prerequisite Packet
Types of Data Quantitative (or measurement) Data These are data that take on numerical values that actually represent a measurement such as size, weight, how many, how long, score on a test, etc. For these
More information2.1 Objectives. Math Chapter 2. Chapter 2. Variable. Categorical Variable EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES
EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES Chapter 2 2.1 Objectives 2.1 What Are the Types of Data? www.managementscientist.org 1. Know the definitions of a. Variable b. Categorical versus quantitative
More informationMATH11400 Statistics Homepage
MATH11400 Statistics 1 2010 11 Homepage http://www.stats.bris.ac.uk/%7emapjg/teach/stats1/ 1.1 A Framework for Statistical Problems Many statistical problems can be described by a simple framework in which
More informationSTP 226 ELEMENTARY STATISTICS NOTES PART 2 - DESCRIPTIVE STATISTICS CHAPTER 3 DESCRIPTIVE MEASURES
STP 6 ELEMENTARY STATISTICS NOTES PART - DESCRIPTIVE STATISTICS CHAPTER 3 DESCRIPTIVE MEASURES Chapter covered organizing data into tables, and summarizing data with graphical displays. We will now use
More informationDSC 201: Data Analysis & Visualization
DSC 201: Data Analysis & Visualization Exploratory Data Analysis Dr. David Koop What is Exploratory Data Analysis? "Detective work" to summarize and explore datasets Includes: - Data acquisition and input
More informationData Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski
Data Analysis and Solver Plugins for KSpread USER S MANUAL Tomasz Maliszewski tmaliszewski@wp.pl Table of Content CHAPTER 1: INTRODUCTION... 3 1.1. ABOUT DATA ANALYSIS PLUGIN... 3 1.3. ABOUT SOLVER PLUGIN...
More informationData Mining: Exploring Data
Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar But we start with a brief discussion of the Friedman article and the relationship between Data
More informationSpecial Review Section. Copyright 2014 Pearson Education, Inc.
Special Review Section SRS-1--1 Special Review Section Chapter 1: The Where, Why, and How of Data Collection Chapter 2: Graphs, Charts, and Tables Describing Your Data Chapter 3: Describing Data Using
More informationMineração de Dados Aplicada
Data Exploration August, 9 th 2017 DCC ICEx UFMG Summary of the last session Data mining Data mining is an empiricism; It can be seen as a generalization of querying; It lacks a unified theory; It implies
More informationExam Review: Ch. 1-3 Answer Section
Exam Review: Ch. 1-3 Answer Section MDM 4U0 MULTIPLE CHOICE 1. ANS: A Section 1.6 2. ANS: A Section 1.6 3. ANS: A Section 1.7 4. ANS: A Section 1.7 5. ANS: C Section 2.3 6. ANS: B Section 2.3 7. ANS: D
More information1.2. Pictorial and Tabular Methods in Descriptive Statistics
1.2. Pictorial and Tabular Methods in Descriptive Statistics Section Objectives. 1. Stem-and-Leaf displays. 2. Dotplots. 3. Histogram. Types of histogram shapes. Common notation. Sample size n : the number
More informationData Mining: Exploring Data. Lecture Notes for Chapter 3
Data Mining: Exploring Data Lecture Notes for Chapter 3 Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler Look for accompanying R code on the course web site. Topics Exploratory Data Analysis
More informationExploratory Data Analysis EDA
Exploratory Data Analysis EDA Luc Anselin http://spatial.uchicago.edu 1 from EDA to ESDA dynamic graphics primer on multivariate EDA interpretation and limitations 2 From EDA to ESDA 3 Exploratory Data
More informationAND NUMERICAL SUMMARIES. Chapter 2
EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES Chapter 2 2.1 What Are the Types of Data? 2.1 Objectives www.managementscientist.org 1. Know the definitions of a. Variable b. Categorical versus quantitative
More informationTMTH 3360 NOTES ON COMMON GRAPHS AND CHARTS
To Describe Data, consider: Symmetry Skewness TMTH 3360 NOTES ON COMMON GRAPHS AND CHARTS Unimodal or bimodal or uniform Extreme values Range of Values and mid-range Most frequently occurring values In
More informationFathom Dynamic Data TM Version 2 Specifications
Data Sources Fathom Dynamic Data TM Version 2 Specifications Use data from one of the many sample documents that come with Fathom. Enter your own data by typing into a case table. Paste data from other
More informationChapter2 Description of samples and populations. 2.1 Introduction.
Chapter2 Description of samples and populations. 2.1 Introduction. Statistics=science of analyzing data. Information collected (data) is gathered in terms of variables (characteristics of a subject that
More informationMATH& 146 Lesson 8. Section 1.6 Averages and Variation
MATH& 146 Lesson 8 Section 1.6 Averages and Variation 1 Summarizing Data The distribution of a variable is the overall pattern of how often the possible values occur. For numerical variables, three summary
More informationData Mining By IK Unit 4. Unit 4
Unit 4 Data mining can be classified into two categories 1) Descriptive mining: describes concepts or task-relevant data sets in concise, summarative, informative, discriminative forms 2) Predictive mining:
More informationExcel 2010 with XLSTAT
Excel 2010 with XLSTAT J E N N I F E R LE W I S PR I E S T L E Y, PH.D. Introduction to Excel 2010 with XLSTAT The layout for Excel 2010 is slightly different from the layout for Excel 2007. However, with
More informationCHAPTER 3: Data Description
CHAPTER 3: Data Description You ve tabulated and made pretty pictures. Now what numbers do you use to summarize your data? Ch3: Data Description Santorico Page 68 You ll find a link on our website to a
More informationData Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality
Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation = noisy: containing
More informationUnderstanding and Comparing Distributions. Chapter 4
Understanding and Comparing Distributions Chapter 4 Objectives: Boxplot Calculate Outliers Comparing Distributions Timeplot The Big Picture We can answer much more interesting questions about variables
More informationRegression III: Advanced Methods
Lecture 3: Distributions Regression III: Advanced Methods William G. Jacoby Michigan State University Goals of the lecture Examine data in graphical form Graphs for looking at univariate distributions
More informationData Exploration & Visualization
Introduction to Data Mining Data Exploration & Visualization CPSC/AMTH 445a/545a Guy Wolf guy.wolf@yale.edu Yale University Fall 2016 CPSC 445 (Guy Wolf) Data Exploration Yale - Fall 2016 1 / 46 Outline
More informationChapter 5. Understanding and Comparing Distributions. Copyright 2012, 2008, 2005 Pearson Education, Inc.
Chapter 5 Understanding and Comparing Distributions The Big Picture We can answer much more interesting questions about variables when we compare distributions for different groups. Below is a histogram
More informationFurther Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables
Further Maths Notes Common Mistakes Read the bold words in the exam! Always check data entry Remember to interpret data with the multipliers specified (e.g. in thousands) Write equations in terms of variables
More informationData analysis using Microsoft Excel
Introduction to Statistics Statistics may be defined as the science of collection, organization presentation analysis and interpretation of numerical data from the logical analysis. 1.Collection of Data
More information8 Organizing and Displaying
CHAPTER 8 Organizing and Displaying Data for Comparison Chapter Outline 8.1 BASIC GRAPH TYPES 8.2 DOUBLE LINE GRAPHS 8.3 TWO-SIDED STEM-AND-LEAF PLOTS 8.4 DOUBLE BAR GRAPHS 8.5 DOUBLE BOX-AND-WHISKER PLOTS
More informationData Preprocessing. Slides by: Shree Jaswal
Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data
More informationStatistical Methods. Instructor: Lingsong Zhang. Any questions, ask me during the office hour, or me, I will answer promptly.
Statistical Methods Instructor: Lingsong Zhang 1 Issues before Class Statistical Methods Lingsong Zhang Office: Math 544 Email: lingsong@purdue.edu Phone: 765-494-7913 Office Hour: Monday 1:00 pm - 2:00
More informationVCEasy VISUAL FURTHER MATHS. Overview
VCEasy VISUAL FURTHER MATHS Overview This booklet is a visual overview of the knowledge required for the VCE Year 12 Further Maths examination.! This booklet does not replace any existing resources that
More informationIn Minitab interface has two windows named Session window and Worksheet window.
Minitab Minitab is a statistics package. It was developed at the Pennsylvania State University by researchers Barbara F. Ryan, Thomas A. Ryan, Jr., and Brian L. Joiner in 1972. Minitab began as a light
More informationData Mining. CS57300 Purdue University. Bruno Ribeiro. February 1st, 2018
Data Mining CS57300 Purdue University Bruno Ribeiro February 1st, 2018 1 Exploratory Data Analysis & Feature Construction How to explore a dataset Understanding the variables (values, ranges, and empirical
More informationMean Tests & X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242
Mean Tests & X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242 Creation & Description of a Data Set * 4 Levels of Measurement * Nominal, ordinal, interval, ratio * Variable Types
More informationAnalytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.
Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied
More informationMiddle Years Data Analysis Display Methods
Middle Years Data Analysis Display Methods Double Bar Graph A double bar graph is an extension of a single bar graph. Any bar graph involves categories and counts of the number of people or things (frequency)
More informationSection 9: One Variable Statistics
The following Mathematics Florida Standards will be covered in this section: MAFS.912.S-ID.1.1 MAFS.912.S-ID.1.2 MAFS.912.S-ID.1.3 Represent data with plots on the real number line (dot plots, histograms,
More informationMaking Science Graphs and Interpreting Data
Making Science Graphs and Interpreting Data Eye Opener: 5 mins What do you see? What do you think? Look up terms you don t know What do Graphs Tell You? A graph is a way of expressing a relationship between
More informationWeek 7 Picturing Network. Vahe and Bethany
Week 7 Picturing Network Vahe and Bethany Freeman (2005) - Graphic Techniques for Exploring Social Network Data The two main goals of analyzing social network data are identification of cohesive groups
More informationPreprocessing Short Lecture Notes cse352. Professor Anita Wasilewska
Preprocessing Short Lecture Notes cse352 Professor Anita Wasilewska Data Preprocessing Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept
More informationWhat are we working with? Data Abstractions. Week 4 Lecture A IAT 814 Lyn Bartram
What are we working with? Data Abstractions Week 4 Lecture A IAT 814 Lyn Bartram Munzner s What-Why-How What are we working with? DATA abstractions, statistical methods Why are we doing it? Task abstractions
More informationData Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing
Data Mining: Concepts and Techniques (3 rd ed.) Chapter 3 1 Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major Tasks in Data Preprocessing Data Cleaning Data Integration Data
More informationCHAPTER 2 DESCRIPTIVE STATISTICS
CHAPTER 2 DESCRIPTIVE STATISTICS 1. Stem-and-Leaf Graphs, Line Graphs, and Bar Graphs The distribution of data is how the data is spread or distributed over the range of the data values. This is one of
More informationChapter 3: Data Mining:
Chapter 3: Data Mining: 3.1 What is Data Mining? Data Mining is the process of automatically discovering useful information in large repository. Why do we need Data mining? Conventional database systems
More informationAt the end of the chapter, you will learn to: Present data in textual form. Construct different types of table and graphs
DATA PRESENTATION At the end of the chapter, you will learn to: Present data in textual form Construct different types of table and graphs Identify the characteristics of a good table and graph Identify
More informationTo calculate the arithmetic mean, sum all the values and divide by n (equivalently, multiple 1/n): 1 n. = 29 years.
3: Summary Statistics Notation Consider these 10 ages (in years): 1 4 5 11 30 50 8 7 4 5 The symbol n represents the sample size (n = 10). The capital letter X denotes the variable. x i represents the
More information