Data Exploration & Visualization

Size: px
Start display at page:

Download "Data Exploration & Visualization"

Transcription

1 Introduction to Data Mining Data Exploration & Visualization CPSC/AMTH 445a/545a Guy Wolf Yale University Fall 2016 CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

2 Outline 1 Tabular data Observations/Data-Points vs. Features/Attributes Qualitative vs. Quantitative attributes Qualitative: Nominal vs. Ordinal Quantitative: Interval vs. Ratio 2 Summary statistics Frequency, mode, & percentiles Mean & median Range & variance Covariance & correlation Data quality 3 Visualizations Box plots Histograms Star plots CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

3 Outline (cont.) Parallel coordinate plots Scatter plots Quiver plots 4 Transactional data Term matrix Text documents 5 Structured signals (e.g., audio and EEG) Fourier & wavelets Spectrogram & scalogram 6 Multidimensional signals (e.g., images and videos) Visualization with contour plots 7 Nonparametric (affinity-/distance-based) representations Graph data Visualization with matrix plots CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

4 What is data? CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

5 What is data? Experimental vs. observational data Experimental data Data collected from strictly controlled/designed experiments with efforts made to ensure statistical validity. Examples Medical clinical trials Election polls Observational data Data collected from real-world settings without control over the captured underlying phenomena. It is easier to collect and obtain, but results and conclusions from such data may be biased or inconclusive. Almost all data used in data mining is observational data. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

6 Tabular Data CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

7 Tabular data Organizing data in a table of observations-by-features is considered the most convenient and standard format for data analysis. Example Consider the following procedure: 1 From each machine, collect 3 temperature measurements (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes), and 2 power consumption values (MOBO, GPU) 2 Attach unique identifiers of the machine, OS, and hardware manufacturer 3 Every second, store a record with these values from every machine in the system. We end up with hundreds of thousands of records, each containing 12 fields. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

8 Tabular data Observations/Data-Points vs. Features/Attributes Features/attributes/properties/fields {}}{ Timestamp OS Temp CPU # proc Observations/objects/datapoints/samples/records /1/16 1:00 AM LNX 45 C 65% CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

9 Tabular data Types of features/attributes It is important to recognize the types of values each feature/attribute takes in order to understand which operations make sense for it. Examples Can we compute an average eye color? How do we compute the difference between phone numbers? Can we say today is twice as hot/cold as yesterday? This is similar to problems like 6 apples / 4 people = 1.5 apples per person, but 10 people / 4 car seats = 3 cars. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

10 Tabular data Qualitative vs. Quantitative attributes Attribute values can be split into two types: Qualitative attributes Attributes that take values from a (finite) set of categories are called categorical or qualitative attributes. In some sense, they describe an object/observation, rather than measure its properties. Quantitative attributes Attributes that represent quantities are called numerical or quantitative attributes. They provide concrete quantifiable measurements of an object/observation. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

11 Tabular data Qualitative: Nominal vs. Ordinal Qualitative attributes can be split further into two types: Nominal attributes Examples: zip codes, eye color, operating system, gender Values of such attributes just specify names without any particular order or relation between them (except for = and ). Binary attributes are nominal attributes with only two values (Yes/No or 0/1). They can be symmetric or asymmetric based in whether or not their values are equally informative. Ordinal attributes Examples: ratings, grades, street/avenue numbers Values of such attributes have some order, even though they don t specify an exact quantity CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

12 Tabular data Quantitative: Interval vs. Ratio Quantitative attributes can also be split into two types: Interval attributes Examples: calendar dates, azimuth direction, Fahrenheit temperatures Such attributes represent quantities with meaningful difference (or fixed intervals) between their values (but no multiplicative relations). Ratio attributes Examples: mass, length, distance, currency, age, electrical current Such attributes represent quantities that have meaningful ratios between their values. Unlike interval attributes, ratio ones usually have an absolute zero. We can also split quantities into discrete and continuous ones. All qualitative attributes are considered discrete. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

13 Tabular data Summary of attribute types The types of attributes can be regarded via the operations that can be applied to them: Comparison (= and ) - every type Ordering (> and <) - every type except nominal Differences ( ) and addition (+) - only quantitative Division (/) and multiplication (, ) - only ratio Other operations (e.g., mean, median, correlation) may also be inapplicable for some types while applicable to others. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

14 Tabular data Technical formats Tabular data can be stored, collected, or given in several standard formats, such as: Comma separated file (CSV) Flat file or delimited text file (e.g., space or tab delimited) XML or other log files Proprietary formats (e.g., FCS for biological data or MAT files for Matlab data) Database tables There are several techniques and standard designs to collect and store big data in databases. Data warehouse, ETL (extract-transform-load), and OLAP (Online Analytical Processing) are some related terms encountered frequently in the IT industry. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

15 Tabular data Data warehouse: star and snowflake schemas Star schema CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

16 Tabular data Data warehouse: star and snowflake schemas Snowflake schema CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

17 Summary statistics The raw representation of the data is often not convenient for initial exploration and understanding of the data. How do we get general insights into the data and its attributes as a whole? Summary statistics Properties that summarize global information, such as central tendency, spread, and variations of observations and features. These statistics provide an important first step in data analysis and most of them are not difficult to compute in linear time w.r.t the size of the data. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

18 Summary statistics Frequency, mode, & percentiles Frequency The portion (e.g., percentage) of the observation with each specific value of a categorical or discrete attribute. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

19 Summary statistics Frequency, mode, & percentiles Frequency The portion (e.g., percentage) of the observation with each specific value of a categorical or discrete attribute. Mode The most frequent value of an attribute in the data. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

20 Summary statistics Frequency, mode, & percentiles Frequency The portion (e.g., percentage) of the observation with each specific value of a categorical or discrete attribute. Mode The most frequent value of an attribute in the data. Percentiles The p-th percentile (with 0 p 100) of an attribute is a value P p such that p% of the observed values of this attributes are less than P p. We typically take P p as one of the observed values of the attributes. Alternatives: quartile Q i (i = 1, 2, 3), quantile, etc. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

21 Summary statistics Frequency, mode, & percentiles Frequency The portion (e.g., percentage) of the observation with each specific value of a categorical or discrete attribute. Mode The most frequent value of an attribute in the data. Percentiles The p-th percentile (with 0 p 100) of an attribute is a value P p such that p% of the observed values of this attributes are less than P p. We typically take P p as one of the observed values of the attributes. Alternatives: quartile Q i (i = 1, 2, 3), quantile, etc. Visual examples: stem-and-leaf displays; quantile & Q Q plots. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

22 Summary statistics Frequency, mode, & percentiles CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

23 Summary statistics Frequency, mode, & percentiles CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

24 Summary statistics Frequency, mode, & percentiles CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

25 Summary statistics Mean & median Mean The mean (or average) x = 1 n ni=1 x n is the most common way to measure the central location or value of data points. However, it is very sensitive to outliers. A trimmed mean is more robust to outliers by disregarding extreme values. Weighted mean also takes into account weights for each observation. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

26 Summary statistics Mean & median Mean The mean (or average) x = 1 n ni=1 x n is the most common way to measure the central location or value of data points. However, it is very sensitive to outliers. A trimmed mean is more robust to outliers by disregarding extreme values. Weighted mean also takes into account weights for each observation. Median The median of an attribute is a value such that half of the observed values are above it and half are below it. It is the middle value for an odd number of observations, or the average (when it makes sense) between the two middle numbers for an even number of observations. The median corresponds to P 50 and Q 2. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

27 Summary statistics Centrality and skewed data Relations between three measures of centrality (mean, median, and mode) can indicate symmetric or skewed distributions of attributes: symmetric positively skewed negatively skewed CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

28 Summary statistics Range & variance Range Range is the difference between max and min observed values of an attribute CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

29 Summary statistics Range & variance Range Range is the difference between max and min observed values of an attribute Variance Variance s 2 x = 1 n ni=1 (x i x) 2 and standard deviation (STD) s x = s 2 x are the most common ways to measure the spread of values. However, like the mean, they are sensitive to outliers. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

30 Summary statistics Range & variance Range Range is the difference between max and min observed values of an attribute Variance Variance s 2 x = 1 n ni=1 (x i x) 2 and standard deviation (STD) s x = s 2 x are the most common ways to measure the spread of values. However, like the mean, they are sensitive to outliers. Other spread measures include: average absolute deviation - the average of x i x median absolute deviation - the median of x i x interquartile range - the difference x 75% x 25% CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

31 Summary statistics Covariance & correlation Covariance Measures the degree to which attributes vary together and is computed by cov(x, y) = 1 n ni=1 (x i x)(y i ȳ). This value depends on the magnitude/spread of the attribute values. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

32 Summary statistics Covariance & correlation Covariance Measures the degree to which attributes vary together and is computed by cov(x, y) = 1 n ni=1 (x i x)(y i ȳ). This value depends on the magnitude/spread of the attribute values. For k attributes, these form a k k covariance matrix, with variances s 2 x = cov(x, x) on its diagonal. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

33 Notice that Pearson correlation is the covariance or dot-product between standardized attributes. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46 Summary statistics Covariance & correlation Covariance Measures the degree to which attributes vary together and is computed by cov(x, y) = 1 n ni=1 (x i x)(y i ȳ). This value depends on the magnitude/spread of the attribute values. For k attributes, these form a k k covariance matrix, with variances s 2 x = cov(x, x) on its diagonal. Correlation A value between 0 and 1 that indicates how strongly two attributes are (linearly) related. Pearson correlation: corr(x, y) = cov(x,y) s x s y. Notice that it is independent magnitudes/spreads and corr(x, x) = 1.

34 Summary statistics Covariance & correlation Correlation A value between 0 and 1 that indicates how strongly two attributes are (linearly) related. Pearson correlation: corr(x, y) = cov(x,y) s x s y. Notice that it is independent magnitudes/spreads and corr(x, x) = 1. Notice that Pearson correlation is the covariance or dot-product between standardized attributes. corr(x, y) = sx 1 sy 1 1 n (x i x)(y i ȳ) n i=1 = 1 n ( ) xi x ( ) y i ȳ n i=1 s x s y CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

35 Summary statistics Misleading example: pirates & global warming Taken from Wikipedia CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

36 Summary statistics Data quality Summary statistics enable identification of various data quality issues, such as Precision The closeness of repeated measurements to one another. Bias A systematic variation of measurements from the quantity being measured. Accuracy The closeness of measurements to the true value of the quantity being measured. Other issues include missing values, outliers, and duplicate values. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

37 Visualizations Why do we need visualizations? While summary statistics provide useful information about the data, they can be overwhelming and hard to track when many attributes are considered. Visualization Conversion of data into visual elements that express characteristics, relationships, and information about data points and attributes. Visualizations provide graphic representations that enable us to draw insights at a single glance. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

38 Visualizations Why do we need visualizations? Example CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

39 Visualizations Why do we need visualizations? Example (TreeMap) Taken from Wikipedia CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

40 Visualizations Why do we need visualizations? Example (TreeMap) Taken from Wikipedia CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

41 Visualizations What constitutes a good visualization? No really good answer... but there are some general guidelines: ACCENT principles Apprehension: we can correctly perceive relations among variables. Clarity: visually distinguish important relations and elements. Consistency: comparing graphical elements/displays shows faithful (dis)similarities in the data. Efficiency: Necessity: simplify complex relations and patterns in the visualization. only include necessary graphical elements - no extraneous elements. Truthfulness: true values (absolute or relative) can be determined from graphical elements. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

42 Visualizations Box plots Box plots (invented by J.Tukey) show the five-number distribution of attribute values based on percentiles: Outliers 3 90th percentile 75th percentile Median 25th percentile 10th percentile CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

43 Visualizations Histograms CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

44 Visualizations Histograms CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

45 Visualizations Histograms CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

46 Visualizations Histograms Taken from: Pierchala, C. The choice of age groupings may affect the quality of tabular presentations CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

47 Visualizations Star plots CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

48 Visualizations Star plots Taken from: CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

49 Visualizations Parallel coordinate plots Notice that attributes in this case do not have a particular order. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

50 Visualizations Scatter plots CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

51 Visualizations Quiver plots CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

52 Non-tabular Data CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

53 Transactional data In transactional data, each observation is a transaction that contains a collection of items or sequence of events. Example Market basket data Customer #1: {milk, bread, butter}; Customer #2: {orange juice, milk}; Customer #3: {orange juice, peanut butter, jelly, bread};... Transaction items can also contain numerical attributes, such as the number of purchased items (e.g., 3 boxes of cookies) or their price. When sequences (e.g., events, actions, or genes) are considered, temporal/order information may also be included. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

54 Transactional data Term matrix In some cases, transactional data can be converted to tabular form by considering term matrix (a.k.a. bag of words/features techniques). Example CustomerID milk bread butter O.J. cheese P.B. jelly Customer# Customer# Customer# This representation looses sequential information, and to applying it to continuous values requires a discretization step. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

55 Transactional data Text documents Text documents can be considered as transactional data in one of two ways: CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

56 Transactional data Text documents Text documents can be considered as transactional data in one of two ways: 1 Each document can be considered as a big transaction containing words. Bag of words techniques ignore grammatical structures and represent a document as a histogram of word occurrences. Similar approaches can also be applied to images, questionnaires, etc., with an appropriate dictionary-building clustering step. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

57 Transactional data Text documents Text documents can be considered as transactional data in one of two ways: 1 Each document can be considered as a big transaction containing words. Bag of words techniques ignore grammatical structures and represent a document as a histogram of word occurrences. Similar approaches can also be applied to images, questionnaires, etc., with an appropriate dictionary-building clustering step. 2 A document can be considered as a transactional dataset on its own, which contains word contexts (e.g., with n-grams or skip-grams). Word2vec techniques use this approach to associate numerical coordinates (typically in R 300 ) to words based on contexts in which they appear. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

58 Transactional data Text documents Example (Term analysis of Donald Trump s twits) Most frequent words: iphone vs. Android: Taken from varianceexplained.org/r/trump-tweets/ CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

59 Transactional data Text documents Taken from github.com/aubry74/visual-word2vec/ CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

60 Structured signals Structured signals have well known relations between their attributes. They are typically numerical, with temporal or spatial ordering. Examples Audio recordings EEG signals Heart rate Room temperatures Each data-point is then a signal collected over time (or space), and we can be analyzed with signal processing tools. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

61 Structured signals Fourier & power spectrum Time series CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

62 Structured signals Fourier & power spectrum Time series Power spectrum CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

63 Structured signals Fourier & power spectrum Time series Power spectrum CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

64 Structured signals STFT & wavelets STFT CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

65 Structured signals STFT & wavelets Wavelets CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

66 Structured signals STFT & wavelets Lowpass Scale 1 Scale 2 Haar wavelets CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

67 Structured signals Spectrogram & scalogram CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

68 Structured signals Spectrogram & scalogram Spectrogram Scalogram CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

69 Multidimensional signals Multidimensional signals are similar have several coordinates that specify the relations between their attributes. Examples Grayscale images have two spatial coordinates that determine pixel positions. Videos have two spatial & one temporal coordinates that determine pixel positions. Geographic data has two or three coordinates determining longitude, latitude, and elevation. Colored and hyperspectral images have two spatial coordinates and one spectral/channel coordinate. In general, many signal processing approaches can be extended from one-dimensional signals to multidimensional ones. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

70 Multidimensional signals Two-dimensional wavelets CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

71 Multidimensional signals Two-dimensional wavelets CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

72 Multidimensional signals Two-dimensional wavelets CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

73 Multidimensional signals Two-dimensional wavelets CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

74 Multidimensional signals Visualization with contour plots CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

75 Multidimensional signals Visualization with contour plots CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

76 Nonparametric representations In some cases, the important information in the data is the relations between data points, rather than their attributes. Examples Spatial locations and trajectories Phone calls and correspondences Gene interactions and cell progressions In these cases an affinity matrix, based on similarity or distances, between data points can be used for analysis. Essentially, each data point is represented by its relations to other data points rather than by its own attributes. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

77 Nonparametric representations Graph data Graphs can be used to formalize relations in data in two ways: CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

78 Nonparametric representations Graph data Graphs can be used to formalize relations in data in two ways: 1 Relationships between attributes can form graphs (e.g., molecule data). In this case each data point is a graph on its own, and this is a more complicated example of structured data. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

79 Nonparametric representations Graph data Graphs can be used to formalize relations in data in two ways: 1 Relationships between attributes can form graphs (e.g., molecule data). Benzene In this(c case 6 H 6 each ): data point is a graph on its own, and this is a more complicated example of structured data. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

80 Nonparametric representations Graph data Graphs can be used to formalize relations in data in two ways: 1 Relationships between attributes can form graphs (e.g., molecule data). In this case each data point is a graph on its own, and this is a more complicated example of structured data. 2 The graph is considered as the dataset, and each node is a data point (e.g., social networks and web-reference data). In this case, an adjacency matrix can form an affinity matrix. Conversely, affinity matrices can form adjacency matrices, so nonparametric data is often considered as graph data. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

81 Nonparametric representations Graph data Graphs can be used to formalize relations in data in two ways: 1 Relationships between attributes can form graphs (e.g., molecule data). In this case each data point is a graph on its own, and this is a more complicated example of structured data. 2 The graph is considered as the dataset, and each node is a data point (e.g., social networks and web-reference data). In this case, an adjacency matrix can form an affinity matrix. Conversely, affinity matrices can form adjacency matrices, so nonparametric data is often considered as graph data. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

82 Nonparametric representations Graph data Graphs can be used to formalize relations in data in two ways: 1 Relationships between attributes can form graphs (e.g., molecule data). In this case each data point is a graph on its own, and this is a more complicated example of structured data. 2 The graph is considered as the dataset, and each node is a data point (e.g., social networks and web-reference data). In this case, an adjacency matrix can form an affinity matrix. Conversely, affinity matrices can form adjacency matrices, so nonparametric data is often considered as graph data. Spectral graph methods (e.g., SVD of graph Laplacian) can be used to associate coordinates to data points in the second case visualization with scatter plots and further analysis. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

83 Nonparametric representations Visualization with matrix plots CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

84 Summary We considered the following data and attribute types, and briefly showed how to handle, process, and visualize them: Types of attributes Nominal Ordinal Interval Ratio Types of data Tabular data Transactional & text data Structured (1D, 2D, & more) data Nonparametric & graph data Exploratory data analysis crucial for obtaining intelligible results, e.g., by identifying valid applicable operations on the data and possibly transforming it to more amenable representation for analysis. Other preprocessing steps include normalization/standardization, sampling, discretization, aggregation and dimensionality reduction. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

ECLT 5810 Data Preprocessing. Prof. Wai Lam

ECLT 5810 Data Preprocessing. Prof. Wai Lam ECLT 5810 Data Preprocessing Prof. Wai Lam Why Data Preprocessing? Data in the real world is imperfect incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate

More information

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things.

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. + What is Data? Data is a collection of facts. Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. In most cases, data needs to be interpreted and

More information

Data Mining: Exploring Data. Lecture Notes for Chapter 3

Data Mining: Exploring Data. Lecture Notes for Chapter 3 Data Mining: Exploring Data Lecture Notes for Chapter 3 1 What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include

More information

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar What is data exploration? A preliminary exploration of the data to better understand its characteristics.

More information

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining Data Mining: Exploring Data Lecture Notes for Data Exploration Chapter Introduction to Data Mining by Tan, Steinbach, Karpatne, Kumar 02/03/2018 Introduction to Data Mining 1 What is data exploration?

More information

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha Data Preprocessing S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha 1 Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking

More information

Chapter 3: Data Mining:

Chapter 3: Data Mining: Chapter 3: Data Mining: 3.1 What is Data Mining? Data Mining is the process of automatically discovering useful information in large repository. Why do we need Data mining? Conventional database systems

More information

STA 570 Spring Lecture 5 Tuesday, Feb 1

STA 570 Spring Lecture 5 Tuesday, Feb 1 STA 570 Spring 2011 Lecture 5 Tuesday, Feb 1 Descriptive Statistics Summarizing Univariate Data o Standard Deviation, Empirical Rule, IQR o Boxplots Summarizing Bivariate Data o Contingency Tables o Row

More information

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data. Summary Statistics Acquisition Description Exploration Examination what data is collected Characterizing properties of data. Exploring the data distribution(s). Identifying data quality problems. Selecting

More information

Data Statistics Population. Census Sample Correlation... Statistical & Practical Significance. Qualitative Data Discrete Data Continuous Data

Data Statistics Population. Census Sample Correlation... Statistical & Practical Significance. Qualitative Data Discrete Data Continuous Data Data Statistics Population Census Sample Correlation... Voluntary Response Sample Statistical & Practical Significance Quantitative Data Qualitative Data Discrete Data Continuous Data Fewer vs Less Ratio

More information

CSE4334/5334 Data Mining 4 Data and Data Preprocessing. Chengkai Li University of Texas at Arlington Fall 2017

CSE4334/5334 Data Mining 4 Data and Data Preprocessing. Chengkai Li University of Texas at Arlington Fall 2017 CSE4334/5334 Data Mining 4 Data and Data Preprocessing Chengkai Li University of Texas at Arlington Fall 2017 10 What is Data? Collection of data objects and their attributes Attributes An attribute is

More information

CS570 Introduction to Data Mining

CS570 Introduction to Data Mining CS570 Introduction to Data Mining Department of Mathematics and Computer Science Li Xiong Data Exploration and Data Preprocessing Data and attributes Data exploration Data pre-processing 2 10 What is Data?

More information

2.1 Objectives. Math Chapter 2. Chapter 2. Variable. Categorical Variable EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES

2.1 Objectives. Math Chapter 2. Chapter 2. Variable. Categorical Variable EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES Chapter 2 2.1 Objectives 2.1 What Are the Types of Data? www.managementscientist.org 1. Know the definitions of a. Variable b. Categorical versus quantitative

More information

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Contents. Foreword to Second Edition. Acknowledgments About the Authors Contents Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments About the Authors xxxi xxxv Chapter 1 Introduction 1 1.1 Why Data Mining? 1 1.1.1 Moving toward the Information Age 1

More information

Table of Contents (As covered from textbook)

Table of Contents (As covered from textbook) Table of Contents (As covered from textbook) Ch 1 Data and Decisions Ch 2 Displaying and Describing Categorical Data Ch 3 Displaying and Describing Quantitative Data Ch 4 Correlation and Linear Regression

More information

Road Map. Data types Measuring data Data cleaning Data integration Data transformation Data reduction Data discretization Summary

Road Map. Data types Measuring data Data cleaning Data integration Data transformation Data reduction Data discretization Summary 2. Data preprocessing Road Map Data types Measuring data Data cleaning Data integration Data transformation Data reduction Data discretization Summary 2 Data types Categorical vs. Numerical Scale types

More information

AND NUMERICAL SUMMARIES. Chapter 2

AND NUMERICAL SUMMARIES. Chapter 2 EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES Chapter 2 2.1 What Are the Types of Data? 2.1 Objectives www.managementscientist.org 1. Know the definitions of a. Variable b. Categorical versus quantitative

More information

Name Date Types of Graphs and Creating Graphs Notes

Name Date Types of Graphs and Creating Graphs Notes Name Date Types of Graphs and Creating Graphs Notes Graphs are helpful visual representations of data. Different graphs display data in different ways. Some graphs show individual data, but many do not.

More information

Data Mining: Exploring Data. Lecture Notes for Chapter 3

Data Mining: Exploring Data. Lecture Notes for Chapter 3 Data Mining: Exploring Data Lecture Notes for Chapter 3 Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler Look for accompanying R code on the course web site. Topics Exploratory Data Analysis

More information

Averages and Variation

Averages and Variation Averages and Variation 3 Copyright Cengage Learning. All rights reserved. 3.1-1 Section 3.1 Measures of Central Tendency: Mode, Median, and Mean Copyright Cengage Learning. All rights reserved. 3.1-2 Focus

More information

M7D1.a: Formulate questions and collect data from a census of at least 30 objects and from samples of varying sizes.

M7D1.a: Formulate questions and collect data from a census of at least 30 objects and from samples of varying sizes. M7D1.a: Formulate questions and collect data from a census of at least 30 objects and from samples of varying sizes. Population: Census: Biased: Sample: The entire group of objects or individuals considered

More information

Data Mining and Analytics. Introduction

Data Mining and Analytics. Introduction Data Mining and Analytics Introduction Data Mining Data mining refers to extracting or mining knowledge from large amounts of data It is also termed as Knowledge Discovery from Data (KDD) Mostly, data

More information

Frequency Distributions

Frequency Distributions Displaying Data Frequency Distributions After collecting data, the first task for a researcher is to organize and summarize the data so that it is possible to get a general overview of the results. Remember,

More information

Chapter 6: DESCRIPTIVE STATISTICS

Chapter 6: DESCRIPTIVE STATISTICS Chapter 6: DESCRIPTIVE STATISTICS Random Sampling Numerical Summaries Stem-n-Leaf plots Histograms, and Box plots Time Sequence Plots Normal Probability Plots Sections 6-1 to 6-5, and 6-7 Random Sampling

More information

Table Of Contents: xix Foreword to Second Edition

Table Of Contents: xix Foreword to Second Edition Data Mining : Concepts and Techniques Table Of Contents: Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments xxxi About the Authors xxxv Chapter 1 Introduction 1 (38) 1.1 Why Data

More information

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency Math 1 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency lowest value + highest value midrange The word average: is very ambiguous and can actually refer to the mean,

More information

STA Module 2B Organizing Data and Comparing Distributions (Part II)

STA Module 2B Organizing Data and Comparing Distributions (Part II) STA 2023 Module 2B Organizing Data and Comparing Distributions (Part II) Learning Objectives Upon completing this module, you should be able to 1 Explain the purpose of a measure of center 2 Obtain and

More information

STA Learning Objectives. Learning Objectives (cont.) Module 2B Organizing Data and Comparing Distributions (Part II)

STA Learning Objectives. Learning Objectives (cont.) Module 2B Organizing Data and Comparing Distributions (Part II) STA 2023 Module 2B Organizing Data and Comparing Distributions (Part II) Learning Objectives Upon completing this module, you should be able to 1 Explain the purpose of a measure of center 2 Obtain and

More information

The basic arrangement of numeric data is called an ARRAY. Array is the derived data from fundamental data Example :- To store marks of 50 student

The basic arrangement of numeric data is called an ARRAY. Array is the derived data from fundamental data Example :- To store marks of 50 student Organizing data Learning Outcome 1. make an array 2. divide the array into class intervals 3. describe the characteristics of a table 4. construct a frequency distribution table 5. constructing a composite

More information

MATH 1070 Introductory Statistics Lecture notes Descriptive Statistics and Graphical Representation

MATH 1070 Introductory Statistics Lecture notes Descriptive Statistics and Graphical Representation MATH 1070 Introductory Statistics Lecture notes Descriptive Statistics and Graphical Representation Objectives: 1. Learn the meaning of descriptive versus inferential statistics 2. Identify bar graphs,

More information

Chapter 2. Descriptive Statistics: Organizing, Displaying and Summarizing Data

Chapter 2. Descriptive Statistics: Organizing, Displaying and Summarizing Data Chapter 2 Descriptive Statistics: Organizing, Displaying and Summarizing Data Objectives Student should be able to Organize data Tabulate data into frequency/relative frequency tables Display data graphically

More information

Data Exploration and Preparation Data Mining and Text Mining (UIC Politecnico di Milano)

Data Exploration and Preparation Data Mining and Text Mining (UIC Politecnico di Milano) Data Exploration and Preparation Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining, : Concepts and Techniques", The Morgan Kaufmann

More information

Topic (3) SUMMARIZING DATA - TABLES AND GRAPHICS

Topic (3) SUMMARIZING DATA - TABLES AND GRAPHICS Topic (3) SUMMARIZING DATA - TABLES AND GRAPHICS 3- Topic (3) SUMMARIZING DATA - TABLES AND GRAPHICS A) Frequency Distributions For Samples Defn: A FREQUENCY DISTRIBUTION is a tabular or graphical display

More information

Unit 7 Statistics. AFM Mrs. Valentine. 7.1 Samples and Surveys

Unit 7 Statistics. AFM Mrs. Valentine. 7.1 Samples and Surveys Unit 7 Statistics AFM Mrs. Valentine 7.1 Samples and Surveys v Obj.: I will understand the different methods of sampling and studying data. I will be able to determine the type used in an example, and

More information

Chapter 2: Descriptive Statistics

Chapter 2: Descriptive Statistics Chapter 2: Descriptive Statistics Student Learning Outcomes By the end of this chapter, you should be able to: Display data graphically and interpret graphs: stemplots, histograms and boxplots. Recognize,

More information

CHAPTER 2: SAMPLING AND DATA

CHAPTER 2: SAMPLING AND DATA CHAPTER 2: SAMPLING AND DATA This presentation is based on material and graphs from Open Stax and is copyrighted by Open Stax and Georgia Highlands College. OUTLINE 2.1 Stem-and-Leaf Graphs (Stemplots),

More information

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset. Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied

More information

STA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 3 Descriptive Measures

STA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 3 Descriptive Measures STA 2023 Module 3 Descriptive Measures Learning Objectives Upon completing this module, you should be able to: 1. Explain the purpose of a measure of center. 2. Obtain and interpret the mean, median, and

More information

CHAPTER 3: Data Description

CHAPTER 3: Data Description CHAPTER 3: Data Description You ve tabulated and made pretty pictures. Now what numbers do you use to summarize your data? Ch3: Data Description Santorico Page 68 You ll find a link on our website to a

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

Chapter 1. Looking at Data-Distribution

Chapter 1. Looking at Data-Distribution Chapter 1. Looking at Data-Distribution Statistics is the scientific discipline that provides methods to draw right conclusions: 1)Collecting the data 2)Describing the data 3)Drawing the conclusions Raw

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 2 Sajjad Haider Spring 2010 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric

More information

Data Mining: Data. What is Data? Lecture Notes for Chapter 2. Introduction to Data Mining. Properties of Attribute Values. Types of Attributes

Data Mining: Data. What is Data? Lecture Notes for Chapter 2. Introduction to Data Mining. Properties of Attribute Values. Types of Attributes 0 Data Mining: Data What is Data? Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Collection of data objects and their attributes An attribute is a property or characteristic

More information

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining 10 Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 What is Data? Collection of data objects

More information

What are we working with? Data Abstractions. Week 4 Lecture A IAT 814 Lyn Bartram

What are we working with? Data Abstractions. Week 4 Lecture A IAT 814 Lyn Bartram What are we working with? Data Abstractions Week 4 Lecture A IAT 814 Lyn Bartram Munzner s What-Why-How What are we working with? DATA abstractions, statistical methods Why are we doing it? Task abstractions

More information

MHPE 494: Data Analysis. Welcome! The Analytic Process

MHPE 494: Data Analysis. Welcome! The Analytic Process MHPE 494: Data Analysis Alan Schwartz, PhD Department of Medical Education Memoona Hasnain,, MD, PhD, MHPE Department of Family Medicine College of Medicine University of Illinois at Chicago Welcome! Your

More information

Visual Analytics. Visualizing multivariate data:

Visual Analytics. Visualizing multivariate data: Visual Analytics 1 Visualizing multivariate data: High density time-series plots Scatterplot matrices Parallel coordinate plots Temporal and spectral correlation plots Box plots Wavelets Radar and /or

More information

Vocabulary. 5-number summary Rule. Area principle. Bar chart. Boxplot. Categorical data condition. Categorical variable.

Vocabulary. 5-number summary Rule. Area principle. Bar chart. Boxplot. Categorical data condition. Categorical variable. 5-number summary 68-95-99.7 Rule Area principle Bar chart Bimodal Boxplot Case Categorical data Categorical variable Center Changing center and spread Conditional distribution Context Contingency table

More information

Statistical Methods. Instructor: Lingsong Zhang. Any questions, ask me during the office hour, or me, I will answer promptly.

Statistical Methods. Instructor: Lingsong Zhang. Any questions, ask me during the office hour, or  me, I will answer promptly. Statistical Methods Instructor: Lingsong Zhang 1 Issues before Class Statistical Methods Lingsong Zhang Office: Math 544 Email: lingsong@purdue.edu Phone: 765-494-7913 Office Hour: Monday 1:00 pm - 2:00

More information

Chapter 3 - Displaying and Summarizing Quantitative Data

Chapter 3 - Displaying and Summarizing Quantitative Data Chapter 3 - Displaying and Summarizing Quantitative Data 3.1 Graphs for Quantitative Data (LABEL GRAPHS) August 25, 2014 Histogram (p. 44) - Graph that uses bars to represent different frequencies or relative

More information

Special Review Section. Copyright 2014 Pearson Education, Inc.

Special Review Section. Copyright 2014 Pearson Education, Inc. Special Review Section SRS-1--1 Special Review Section Chapter 1: The Where, Why, and How of Data Collection Chapter 2: Graphs, Charts, and Tables Describing Your Data Chapter 3: Describing Data Using

More information

Organizing Your Data. Jenny Holcombe, PhD UT College of Medicine Nuts & Bolts Conference August 16, 3013

Organizing Your Data. Jenny Holcombe, PhD UT College of Medicine Nuts & Bolts Conference August 16, 3013 Organizing Your Data Jenny Holcombe, PhD UT College of Medicine Nuts & Bolts Conference August 16, 3013 Learning Objectives Identify Different Types of Variables Appropriately Naming Variables Constructing

More information

Mean Tests & X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242

Mean Tests & X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242 Mean Tests & X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242 Creation & Description of a Data Set * 4 Levels of Measurement * Nominal, ordinal, interval, ratio * Variable Types

More information

CHAPTER-13. Mining Class Comparisons: Discrimination between DifferentClasses: 13.4 Class Description: Presentation of Both Characterization and

CHAPTER-13. Mining Class Comparisons: Discrimination between DifferentClasses: 13.4 Class Description: Presentation of Both Characterization and CHAPTER-13 Mining Class Comparisons: Discrimination between DifferentClasses: 13.1 Introduction 13.2 Class Comparison Methods and Implementation 13.3 Presentation of Class Comparison Descriptions 13.4

More information

Exploring and Understanding Data Using R.

Exploring and Understanding Data Using R. Exploring and Understanding Data Using R. Loading the data into an R data frame: variable

More information

LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA

LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA This lab will assist you in learning how to summarize and display categorical and quantitative data in StatCrunch. In particular, you will learn how to

More information

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order. Chapter 2 2.1 Descriptive Statistics A stem-and-leaf graph, also called a stemplot, allows for a nice overview of quantitative data without losing information on individual observations. It can be a good

More information

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska Preprocessing Short Lecture Notes cse352 Professor Anita Wasilewska Data Preprocessing Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept

More information

1. To condense data in a single value. 2. To facilitate comparisons between data.

1. To condense data in a single value. 2. To facilitate comparisons between data. The main objectives 1. To condense data in a single value. 2. To facilitate comparisons between data. Measures :- Locational (positional ) average Partition values Median Quartiles Deciles Percentiles

More information

Chapter Two: Descriptive Methods 1/50

Chapter Two: Descriptive Methods 1/50 Chapter Two: Descriptive Methods 1/50 2.1 Introduction 2/50 2.1 Introduction We previously said that descriptive statistics is made up of various techniques used to summarize the information contained

More information

MATH 117 Statistical Methods for Management I Chapter Two

MATH 117 Statistical Methods for Management I Chapter Two Jubail University College MATH 117 Statistical Methods for Management I Chapter Two There are a wide variety of ways to summarize, organize, and present data: I. Tables 1. Distribution Table (Categorical

More information

DATA PREPROCESSING. Tzompanaki Katerina

DATA PREPROCESSING. Tzompanaki Katerina DATA PREPROCESSING Tzompanaki Katerina Background: Data storage formats Data in DBMS ODBC, JDBC protocols Data in flat files Fixed-width format (each column has a specific number of characters, filled

More information

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality

More information

STP 226 ELEMENTARY STATISTICS NOTES PART 2 - DESCRIPTIVE STATISTICS CHAPTER 3 DESCRIPTIVE MEASURES

STP 226 ELEMENTARY STATISTICS NOTES PART 2 - DESCRIPTIVE STATISTICS CHAPTER 3 DESCRIPTIVE MEASURES STP 6 ELEMENTARY STATISTICS NOTES PART - DESCRIPTIVE STATISTICS CHAPTER 3 DESCRIPTIVE MEASURES Chapter covered organizing data into tables, and summarizing data with graphical displays. We will now use

More information

Measures of Central Tendency

Measures of Central Tendency Measures of Central Tendency MATH 130, Elements of Statistics I J. Robert Buchanan Department of Mathematics Fall 2017 Introduction Measures of central tendency are designed to provide one number which

More information

CS378 Introduction to Data Mining. Data Exploration and Data Preprocessing. Li Xiong

CS378 Introduction to Data Mining. Data Exploration and Data Preprocessing. Li Xiong CS378 Introduction to Data Mining Data Exploration and Data Preprocessing Li Xiong Data Exploration and Data Preprocessing Data and Attributes Data exploration Data pre-processing Data Mining: Concepts

More information

Data Preprocessing. Slides by: Shree Jaswal

Data Preprocessing. Slides by: Shree Jaswal Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data

More information

No. of blue jelly beans No. of bags

No. of blue jelly beans No. of bags Math 167 Ch5 Review 1 (c) Janice Epstein CHAPTER 5 EXPLORING DATA DISTRIBUTIONS A sample of jelly bean bags is chosen and the number of blue jelly beans in each bag is counted. The results are shown in

More information

Getting to Know Your Data

Getting to Know Your Data Chapter 2 Getting to Know Your Data 2.1 Exercises 1. Give three additional commonly used statistical measures (i.e., not illustrated in this chapter) for the characterization of data dispersion, and discuss

More information

Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Data Mining: Concepts and Techniques Chapter 2 Original Slides: Jiawei Han and Micheline Kamber Modification: Li Xiong Data Mining: Concepts and Techniques 1 Chapter 2: Data Preprocessing Why preprocess

More information

Basic Statistical Terms and Definitions

Basic Statistical Terms and Definitions I. Basics Basic Statistical Terms and Definitions Statistics is a collection of methods for planning experiments, and obtaining data. The data is then organized and summarized so that professionals can

More information

Chapter 2: Frequency Distributions

Chapter 2: Frequency Distributions Chapter 2: Frequency Distributions Chapter Outline 2.1 Introduction to Frequency Distributions 2.2 Frequency Distribution Tables Obtaining ΣX from a Frequency Distribution Table Proportions and Percentages

More information

Middle Years Data Analysis Display Methods

Middle Years Data Analysis Display Methods Middle Years Data Analysis Display Methods Double Bar Graph A double bar graph is an extension of a single bar graph. Any bar graph involves categories and counts of the number of people or things (frequency)

More information

WELCOME! Lecture 3 Thommy Perlinger

WELCOME! Lecture 3 Thommy Perlinger Quantitative Methods II WELCOME! Lecture 3 Thommy Perlinger Program Lecture 3 Cleaning and transforming data Graphical examination of the data Missing Values Graphical examination of the data It is important

More information

AP Statistics Prerequisite Packet

AP Statistics Prerequisite Packet Types of Data Quantitative (or measurement) Data These are data that take on numerical values that actually represent a measurement such as size, weight, how many, how long, score on a test, etc. For these

More information

CHAPTER 2 DESCRIPTIVE STATISTICS

CHAPTER 2 DESCRIPTIVE STATISTICS CHAPTER 2 DESCRIPTIVE STATISTICS 1. Stem-and-Leaf Graphs, Line Graphs, and Bar Graphs The distribution of data is how the data is spread or distributed over the range of the data values. This is one of

More information

Chapter 2 Describing, Exploring, and Comparing Data

Chapter 2 Describing, Exploring, and Comparing Data Slide 1 Chapter 2 Describing, Exploring, and Comparing Data Slide 2 2-1 Overview 2-2 Frequency Distributions 2-3 Visualizing Data 2-4 Measures of Center 2-5 Measures of Variation 2-6 Measures of Relative

More information

Exploratory/Visual Data Analysis

Exploratory/Visual Data Analysis Exploratory/Visual Data Analysis Intelligent Data Analysis http://www.mit.bme.hu/node/8036 9/14/2018 Budapest University of Technology and Economics Fault Tolerant Systems Research Group Budapesti Műszaki

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing Data Mining: Concepts and Techniques (3 rd ed.) Chapter 3 1 Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major Tasks in Data Preprocessing Data Cleaning Data Integration Data

More information

LESSON 3: CENTRAL TENDENCY

LESSON 3: CENTRAL TENDENCY LESSON 3: CENTRAL TENDENCY Outline Arithmetic mean, median and mode Ungrouped data Grouped data Percentiles, fractiles, and quartiles Ungrouped data Grouped data 1 MEAN Mean is defined as follows: Sum

More information

Nuts and Bolts Research Methods Symposium

Nuts and Bolts Research Methods Symposium Organizing Your Data Jenny Holcombe, PhD UT College of Medicine Nuts & Bolts Conference August 16, 3013 Topics to Discuss: Types of Variables Constructing a Variable Code Book Developing Excel Spreadsheets

More information

At the end of the chapter, you will learn to: Present data in textual form. Construct different types of table and graphs

At the end of the chapter, you will learn to: Present data in textual form. Construct different types of table and graphs DATA PRESENTATION At the end of the chapter, you will learn to: Present data in textual form Construct different types of table and graphs Identify the characteristics of a good table and graph Identify

More information

Data Foundations. Topic Objectives. and list subcategories of each. its properties. before producing a visualization. subsetting

Data Foundations. Topic Objectives. and list subcategories of each. its properties. before producing a visualization. subsetting CS 725/825 Information Visualization Fall 2013 Data Foundations Dr. Michele C. Weigle http://www.cs.odu.edu/~mweigle/cs725-f13/ Topic Objectives! Distinguish between ordinal and nominal values and list

More information

Chapter 3: Data Description - Part 3. Homework: Exercises 1-21 odd, odd, odd, 107, 109, 118, 119, 120, odd

Chapter 3: Data Description - Part 3. Homework: Exercises 1-21 odd, odd, odd, 107, 109, 118, 119, 120, odd Chapter 3: Data Description - Part 3 Read: Sections 1 through 5 pp 92-149 Work the following text examples: Section 3.2, 3-1 through 3-17 Section 3.3, 3-22 through 3.28, 3-42 through 3.82 Section 3.4,

More information

Themes in the Texas CCRS - Mathematics

Themes in the Texas CCRS - Mathematics 1. Compare real numbers. a. Classify numbers as natural, whole, integers, rational, irrational, real, imaginary, &/or complex. b. Use and apply the relative magnitude of real numbers by using inequality

More information

DSC 201: Data Analysis & Visualization

DSC 201: Data Analysis & Visualization DSC 201: Data Analysis & Visualization Exploratory Data Analysis Dr. David Koop What is Exploratory Data Analysis? "Detective work" to summarize and explore datasets Includes: - Data acquisition and input

More information

Exploratory Data Analysis

Exploratory Data Analysis Chapter 10 Exploratory Data Analysis Definition of Exploratory Data Analysis (page 410) Definition 12.1. Exploratory data analysis (EDA) is a subfield of applied statistics that is concerned with the investigation

More information

刘淇 School of Computer Science and Technology USTC

刘淇 School of Computer Science and Technology USTC Data Exploration 刘淇 School of Computer Science and Technology USTC http://staff.ustc.edu.cn/~qiliuql/dm2013.html t t / l/dm2013 l What is data exploration? A preliminary exploration of the data to better

More information

Data Mining: Exploring Data

Data Mining: Exploring Data Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar But we start with a brief discussion of the Friedman article and the relationship between Data

More information

Downloaded from

Downloaded from UNIT 2 WHAT IS STATISTICS? Researchers deal with a large amount of data and have to draw dependable conclusions on the basis of data collected for the purpose. Statistics help the researchers in making

More information

Sections Graphical Displays and Measures of Center. Brian Habing Department of Statistics University of South Carolina.

Sections Graphical Displays and Measures of Center. Brian Habing Department of Statistics University of South Carolina. STAT 515 Statistical Methods I Sections 2.1-2.3 Graphical Displays and Measures of Center Brian Habing Department of Statistics University of South Carolina Redistribution of these slides without permission

More information

CHAPTER 1. Introduction. Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data.

CHAPTER 1. Introduction. Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data. 1 CHAPTER 1 Introduction Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data. Variable: Any characteristic of a person or thing that can be expressed

More information

Predict Outcomes and Reveal Relationships in Categorical Data

Predict Outcomes and Reveal Relationships in Categorical Data PASW Categories 18 Specifications Predict Outcomes and Reveal Relationships in Categorical Data Unleash the full potential of your data through predictive analysis, statistical learning, perceptual mapping,

More information

Mean,Median, Mode Teacher Twins 2015

Mean,Median, Mode Teacher Twins 2015 Mean,Median, Mode Teacher Twins 2015 Warm Up How can you change the non-statistical question below to make it a statistical question? How many pets do you have? Possible answer: What is your favorite type

More information

Ex.1 constructing tables. a) find the joint relative frequency of males who have a bachelors degree.

Ex.1 constructing tables. a) find the joint relative frequency of males who have a bachelors degree. Two-way Frequency Tables two way frequency table- a table that divides responses into categories. Joint relative frequency- the number of times a specific response is given divided by the sample. Marginal

More information

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data. Code No: M0502/R05 Set No. 1 1. (a) Explain data mining as a step in the process of knowledge discovery. (b) Differentiate operational database systems and data warehousing. [8+8] 2. (a) Briefly discuss

More information

Spatial Outlier Detection

Spatial Outlier Detection Spatial Outlier Detection Chang-Tien Lu Department of Computer Science Northern Virginia Center Virginia Tech Joint work with Dechang Chen, Yufeng Kou, Jiang Zhao 1 Spatial Outlier A spatial data point

More information

2.1: Frequency Distributions and Their Graphs

2.1: Frequency Distributions and Their Graphs 2.1: Frequency Distributions and Their Graphs Frequency Distribution - way to display data that has many entries - table that shows classes or intervals of data entries and the number of entries in each

More information

Exploring Data data exploration Exploratory Data Analysis

Exploring Data data exploration Exploratory Data Analysis 3 Exploring Data The previous chapter addressed high-level data issues that are important in the knowledge discovery process This chapter provides an introduction to data exploration, which is a preliminary

More information

15 Wyner Statistics Fall 2013

15 Wyner Statistics Fall 2013 15 Wyner Statistics Fall 2013 CHAPTER THREE: CENTRAL TENDENCY AND VARIATION Summary, Terms, and Objectives The two most important aspects of a numerical data set are its central tendencies and its variation.

More information