Data Exploration & Visualization
|
|
- Nickolas Carroll
- 6 years ago
- Views:
Transcription
1 Introduction to Data Mining Data Exploration & Visualization CPSC/AMTH 445a/545a Guy Wolf Yale University Fall 2016 CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
2 Outline 1 Tabular data Observations/Data-Points vs. Features/Attributes Qualitative vs. Quantitative attributes Qualitative: Nominal vs. Ordinal Quantitative: Interval vs. Ratio 2 Summary statistics Frequency, mode, & percentiles Mean & median Range & variance Covariance & correlation Data quality 3 Visualizations Box plots Histograms Star plots CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
3 Outline (cont.) Parallel coordinate plots Scatter plots Quiver plots 4 Transactional data Term matrix Text documents 5 Structured signals (e.g., audio and EEG) Fourier & wavelets Spectrogram & scalogram 6 Multidimensional signals (e.g., images and videos) Visualization with contour plots 7 Nonparametric (affinity-/distance-based) representations Graph data Visualization with matrix plots CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
4 What is data? CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
5 What is data? Experimental vs. observational data Experimental data Data collected from strictly controlled/designed experiments with efforts made to ensure statistical validity. Examples Medical clinical trials Election polls Observational data Data collected from real-world settings without control over the captured underlying phenomena. It is easier to collect and obtain, but results and conclusions from such data may be biased or inconclusive. Almost all data used in data mining is observational data. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
6 Tabular Data CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
7 Tabular data Organizing data in a table of observations-by-features is considered the most convenient and standard format for data analysis. Example Consider the following procedure: 1 From each machine, collect 3 temperature measurements (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes), and 2 power consumption values (MOBO, GPU) 2 Attach unique identifiers of the machine, OS, and hardware manufacturer 3 Every second, store a record with these values from every machine in the system. We end up with hundreds of thousands of records, each containing 12 fields. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
8 Tabular data Observations/Data-Points vs. Features/Attributes Features/attributes/properties/fields {}}{ Timestamp OS Temp CPU # proc Observations/objects/datapoints/samples/records /1/16 1:00 AM LNX 45 C 65% CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
9 Tabular data Types of features/attributes It is important to recognize the types of values each feature/attribute takes in order to understand which operations make sense for it. Examples Can we compute an average eye color? How do we compute the difference between phone numbers? Can we say today is twice as hot/cold as yesterday? This is similar to problems like 6 apples / 4 people = 1.5 apples per person, but 10 people / 4 car seats = 3 cars. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
10 Tabular data Qualitative vs. Quantitative attributes Attribute values can be split into two types: Qualitative attributes Attributes that take values from a (finite) set of categories are called categorical or qualitative attributes. In some sense, they describe an object/observation, rather than measure its properties. Quantitative attributes Attributes that represent quantities are called numerical or quantitative attributes. They provide concrete quantifiable measurements of an object/observation. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
11 Tabular data Qualitative: Nominal vs. Ordinal Qualitative attributes can be split further into two types: Nominal attributes Examples: zip codes, eye color, operating system, gender Values of such attributes just specify names without any particular order or relation between them (except for = and ). Binary attributes are nominal attributes with only two values (Yes/No or 0/1). They can be symmetric or asymmetric based in whether or not their values are equally informative. Ordinal attributes Examples: ratings, grades, street/avenue numbers Values of such attributes have some order, even though they don t specify an exact quantity CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
12 Tabular data Quantitative: Interval vs. Ratio Quantitative attributes can also be split into two types: Interval attributes Examples: calendar dates, azimuth direction, Fahrenheit temperatures Such attributes represent quantities with meaningful difference (or fixed intervals) between their values (but no multiplicative relations). Ratio attributes Examples: mass, length, distance, currency, age, electrical current Such attributes represent quantities that have meaningful ratios between their values. Unlike interval attributes, ratio ones usually have an absolute zero. We can also split quantities into discrete and continuous ones. All qualitative attributes are considered discrete. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
13 Tabular data Summary of attribute types The types of attributes can be regarded via the operations that can be applied to them: Comparison (= and ) - every type Ordering (> and <) - every type except nominal Differences ( ) and addition (+) - only quantitative Division (/) and multiplication (, ) - only ratio Other operations (e.g., mean, median, correlation) may also be inapplicable for some types while applicable to others. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
14 Tabular data Technical formats Tabular data can be stored, collected, or given in several standard formats, such as: Comma separated file (CSV) Flat file or delimited text file (e.g., space or tab delimited) XML or other log files Proprietary formats (e.g., FCS for biological data or MAT files for Matlab data) Database tables There are several techniques and standard designs to collect and store big data in databases. Data warehouse, ETL (extract-transform-load), and OLAP (Online Analytical Processing) are some related terms encountered frequently in the IT industry. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
15 Tabular data Data warehouse: star and snowflake schemas Star schema CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
16 Tabular data Data warehouse: star and snowflake schemas Snowflake schema CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
17 Summary statistics The raw representation of the data is often not convenient for initial exploration and understanding of the data. How do we get general insights into the data and its attributes as a whole? Summary statistics Properties that summarize global information, such as central tendency, spread, and variations of observations and features. These statistics provide an important first step in data analysis and most of them are not difficult to compute in linear time w.r.t the size of the data. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
18 Summary statistics Frequency, mode, & percentiles Frequency The portion (e.g., percentage) of the observation with each specific value of a categorical or discrete attribute. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
19 Summary statistics Frequency, mode, & percentiles Frequency The portion (e.g., percentage) of the observation with each specific value of a categorical or discrete attribute. Mode The most frequent value of an attribute in the data. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
20 Summary statistics Frequency, mode, & percentiles Frequency The portion (e.g., percentage) of the observation with each specific value of a categorical or discrete attribute. Mode The most frequent value of an attribute in the data. Percentiles The p-th percentile (with 0 p 100) of an attribute is a value P p such that p% of the observed values of this attributes are less than P p. We typically take P p as one of the observed values of the attributes. Alternatives: quartile Q i (i = 1, 2, 3), quantile, etc. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
21 Summary statistics Frequency, mode, & percentiles Frequency The portion (e.g., percentage) of the observation with each specific value of a categorical or discrete attribute. Mode The most frequent value of an attribute in the data. Percentiles The p-th percentile (with 0 p 100) of an attribute is a value P p such that p% of the observed values of this attributes are less than P p. We typically take P p as one of the observed values of the attributes. Alternatives: quartile Q i (i = 1, 2, 3), quantile, etc. Visual examples: stem-and-leaf displays; quantile & Q Q plots. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
22 Summary statistics Frequency, mode, & percentiles CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
23 Summary statistics Frequency, mode, & percentiles CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
24 Summary statistics Frequency, mode, & percentiles CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
25 Summary statistics Mean & median Mean The mean (or average) x = 1 n ni=1 x n is the most common way to measure the central location or value of data points. However, it is very sensitive to outliers. A trimmed mean is more robust to outliers by disregarding extreme values. Weighted mean also takes into account weights for each observation. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
26 Summary statistics Mean & median Mean The mean (or average) x = 1 n ni=1 x n is the most common way to measure the central location or value of data points. However, it is very sensitive to outliers. A trimmed mean is more robust to outliers by disregarding extreme values. Weighted mean also takes into account weights for each observation. Median The median of an attribute is a value such that half of the observed values are above it and half are below it. It is the middle value for an odd number of observations, or the average (when it makes sense) between the two middle numbers for an even number of observations. The median corresponds to P 50 and Q 2. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
27 Summary statistics Centrality and skewed data Relations between three measures of centrality (mean, median, and mode) can indicate symmetric or skewed distributions of attributes: symmetric positively skewed negatively skewed CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
28 Summary statistics Range & variance Range Range is the difference between max and min observed values of an attribute CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
29 Summary statistics Range & variance Range Range is the difference between max and min observed values of an attribute Variance Variance s 2 x = 1 n ni=1 (x i x) 2 and standard deviation (STD) s x = s 2 x are the most common ways to measure the spread of values. However, like the mean, they are sensitive to outliers. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
30 Summary statistics Range & variance Range Range is the difference between max and min observed values of an attribute Variance Variance s 2 x = 1 n ni=1 (x i x) 2 and standard deviation (STD) s x = s 2 x are the most common ways to measure the spread of values. However, like the mean, they are sensitive to outliers. Other spread measures include: average absolute deviation - the average of x i x median absolute deviation - the median of x i x interquartile range - the difference x 75% x 25% CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
31 Summary statistics Covariance & correlation Covariance Measures the degree to which attributes vary together and is computed by cov(x, y) = 1 n ni=1 (x i x)(y i ȳ). This value depends on the magnitude/spread of the attribute values. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
32 Summary statistics Covariance & correlation Covariance Measures the degree to which attributes vary together and is computed by cov(x, y) = 1 n ni=1 (x i x)(y i ȳ). This value depends on the magnitude/spread of the attribute values. For k attributes, these form a k k covariance matrix, with variances s 2 x = cov(x, x) on its diagonal. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
33 Notice that Pearson correlation is the covariance or dot-product between standardized attributes. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46 Summary statistics Covariance & correlation Covariance Measures the degree to which attributes vary together and is computed by cov(x, y) = 1 n ni=1 (x i x)(y i ȳ). This value depends on the magnitude/spread of the attribute values. For k attributes, these form a k k covariance matrix, with variances s 2 x = cov(x, x) on its diagonal. Correlation A value between 0 and 1 that indicates how strongly two attributes are (linearly) related. Pearson correlation: corr(x, y) = cov(x,y) s x s y. Notice that it is independent magnitudes/spreads and corr(x, x) = 1.
34 Summary statistics Covariance & correlation Correlation A value between 0 and 1 that indicates how strongly two attributes are (linearly) related. Pearson correlation: corr(x, y) = cov(x,y) s x s y. Notice that it is independent magnitudes/spreads and corr(x, x) = 1. Notice that Pearson correlation is the covariance or dot-product between standardized attributes. corr(x, y) = sx 1 sy 1 1 n (x i x)(y i ȳ) n i=1 = 1 n ( ) xi x ( ) y i ȳ n i=1 s x s y CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
35 Summary statistics Misleading example: pirates & global warming Taken from Wikipedia CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
36 Summary statistics Data quality Summary statistics enable identification of various data quality issues, such as Precision The closeness of repeated measurements to one another. Bias A systematic variation of measurements from the quantity being measured. Accuracy The closeness of measurements to the true value of the quantity being measured. Other issues include missing values, outliers, and duplicate values. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
37 Visualizations Why do we need visualizations? While summary statistics provide useful information about the data, they can be overwhelming and hard to track when many attributes are considered. Visualization Conversion of data into visual elements that express characteristics, relationships, and information about data points and attributes. Visualizations provide graphic representations that enable us to draw insights at a single glance. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
38 Visualizations Why do we need visualizations? Example CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
39 Visualizations Why do we need visualizations? Example (TreeMap) Taken from Wikipedia CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
40 Visualizations Why do we need visualizations? Example (TreeMap) Taken from Wikipedia CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
41 Visualizations What constitutes a good visualization? No really good answer... but there are some general guidelines: ACCENT principles Apprehension: we can correctly perceive relations among variables. Clarity: visually distinguish important relations and elements. Consistency: comparing graphical elements/displays shows faithful (dis)similarities in the data. Efficiency: Necessity: simplify complex relations and patterns in the visualization. only include necessary graphical elements - no extraneous elements. Truthfulness: true values (absolute or relative) can be determined from graphical elements. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
42 Visualizations Box plots Box plots (invented by J.Tukey) show the five-number distribution of attribute values based on percentiles: Outliers 3 90th percentile 75th percentile Median 25th percentile 10th percentile CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
43 Visualizations Histograms CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
44 Visualizations Histograms CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
45 Visualizations Histograms CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
46 Visualizations Histograms Taken from: Pierchala, C. The choice of age groupings may affect the quality of tabular presentations CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
47 Visualizations Star plots CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
48 Visualizations Star plots Taken from: CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
49 Visualizations Parallel coordinate plots Notice that attributes in this case do not have a particular order. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
50 Visualizations Scatter plots CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
51 Visualizations Quiver plots CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
52 Non-tabular Data CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
53 Transactional data In transactional data, each observation is a transaction that contains a collection of items or sequence of events. Example Market basket data Customer #1: {milk, bread, butter}; Customer #2: {orange juice, milk}; Customer #3: {orange juice, peanut butter, jelly, bread};... Transaction items can also contain numerical attributes, such as the number of purchased items (e.g., 3 boxes of cookies) or their price. When sequences (e.g., events, actions, or genes) are considered, temporal/order information may also be included. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
54 Transactional data Term matrix In some cases, transactional data can be converted to tabular form by considering term matrix (a.k.a. bag of words/features techniques). Example CustomerID milk bread butter O.J. cheese P.B. jelly Customer# Customer# Customer# This representation looses sequential information, and to applying it to continuous values requires a discretization step. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
55 Transactional data Text documents Text documents can be considered as transactional data in one of two ways: CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
56 Transactional data Text documents Text documents can be considered as transactional data in one of two ways: 1 Each document can be considered as a big transaction containing words. Bag of words techniques ignore grammatical structures and represent a document as a histogram of word occurrences. Similar approaches can also be applied to images, questionnaires, etc., with an appropriate dictionary-building clustering step. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
57 Transactional data Text documents Text documents can be considered as transactional data in one of two ways: 1 Each document can be considered as a big transaction containing words. Bag of words techniques ignore grammatical structures and represent a document as a histogram of word occurrences. Similar approaches can also be applied to images, questionnaires, etc., with an appropriate dictionary-building clustering step. 2 A document can be considered as a transactional dataset on its own, which contains word contexts (e.g., with n-grams or skip-grams). Word2vec techniques use this approach to associate numerical coordinates (typically in R 300 ) to words based on contexts in which they appear. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
58 Transactional data Text documents Example (Term analysis of Donald Trump s twits) Most frequent words: iphone vs. Android: Taken from varianceexplained.org/r/trump-tweets/ CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
59 Transactional data Text documents Taken from github.com/aubry74/visual-word2vec/ CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
60 Structured signals Structured signals have well known relations between their attributes. They are typically numerical, with temporal or spatial ordering. Examples Audio recordings EEG signals Heart rate Room temperatures Each data-point is then a signal collected over time (or space), and we can be analyzed with signal processing tools. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
61 Structured signals Fourier & power spectrum Time series CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
62 Structured signals Fourier & power spectrum Time series Power spectrum CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
63 Structured signals Fourier & power spectrum Time series Power spectrum CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
64 Structured signals STFT & wavelets STFT CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
65 Structured signals STFT & wavelets Wavelets CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
66 Structured signals STFT & wavelets Lowpass Scale 1 Scale 2 Haar wavelets CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
67 Structured signals Spectrogram & scalogram CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
68 Structured signals Spectrogram & scalogram Spectrogram Scalogram CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
69 Multidimensional signals Multidimensional signals are similar have several coordinates that specify the relations between their attributes. Examples Grayscale images have two spatial coordinates that determine pixel positions. Videos have two spatial & one temporal coordinates that determine pixel positions. Geographic data has two or three coordinates determining longitude, latitude, and elevation. Colored and hyperspectral images have two spatial coordinates and one spectral/channel coordinate. In general, many signal processing approaches can be extended from one-dimensional signals to multidimensional ones. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
70 Multidimensional signals Two-dimensional wavelets CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
71 Multidimensional signals Two-dimensional wavelets CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
72 Multidimensional signals Two-dimensional wavelets CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
73 Multidimensional signals Two-dimensional wavelets CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
74 Multidimensional signals Visualization with contour plots CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
75 Multidimensional signals Visualization with contour plots CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
76 Nonparametric representations In some cases, the important information in the data is the relations between data points, rather than their attributes. Examples Spatial locations and trajectories Phone calls and correspondences Gene interactions and cell progressions In these cases an affinity matrix, based on similarity or distances, between data points can be used for analysis. Essentially, each data point is represented by its relations to other data points rather than by its own attributes. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
77 Nonparametric representations Graph data Graphs can be used to formalize relations in data in two ways: CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
78 Nonparametric representations Graph data Graphs can be used to formalize relations in data in two ways: 1 Relationships between attributes can form graphs (e.g., molecule data). In this case each data point is a graph on its own, and this is a more complicated example of structured data. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
79 Nonparametric representations Graph data Graphs can be used to formalize relations in data in two ways: 1 Relationships between attributes can form graphs (e.g., molecule data). Benzene In this(c case 6 H 6 each ): data point is a graph on its own, and this is a more complicated example of structured data. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
80 Nonparametric representations Graph data Graphs can be used to formalize relations in data in two ways: 1 Relationships between attributes can form graphs (e.g., molecule data). In this case each data point is a graph on its own, and this is a more complicated example of structured data. 2 The graph is considered as the dataset, and each node is a data point (e.g., social networks and web-reference data). In this case, an adjacency matrix can form an affinity matrix. Conversely, affinity matrices can form adjacency matrices, so nonparametric data is often considered as graph data. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
81 Nonparametric representations Graph data Graphs can be used to formalize relations in data in two ways: 1 Relationships between attributes can form graphs (e.g., molecule data). In this case each data point is a graph on its own, and this is a more complicated example of structured data. 2 The graph is considered as the dataset, and each node is a data point (e.g., social networks and web-reference data). In this case, an adjacency matrix can form an affinity matrix. Conversely, affinity matrices can form adjacency matrices, so nonparametric data is often considered as graph data. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
82 Nonparametric representations Graph data Graphs can be used to formalize relations in data in two ways: 1 Relationships between attributes can form graphs (e.g., molecule data). In this case each data point is a graph on its own, and this is a more complicated example of structured data. 2 The graph is considered as the dataset, and each node is a data point (e.g., social networks and web-reference data). In this case, an adjacency matrix can form an affinity matrix. Conversely, affinity matrices can form adjacency matrices, so nonparametric data is often considered as graph data. Spectral graph methods (e.g., SVD of graph Laplacian) can be used to associate coordinates to data points in the second case visualization with scatter plots and further analysis. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
83 Nonparametric representations Visualization with matrix plots CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
84 Summary We considered the following data and attribute types, and briefly showed how to handle, process, and visualize them: Types of attributes Nominal Ordinal Interval Ratio Types of data Tabular data Transactional & text data Structured (1D, 2D, & more) data Nonparametric & graph data Exploratory data analysis crucial for obtaining intelligible results, e.g., by identifying valid applicable operations on the data and possibly transforming it to more amenable representation for analysis. Other preprocessing steps include normalization/standardization, sampling, discretization, aggregation and dimensionality reduction. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46
ECLT 5810 Data Preprocessing. Prof. Wai Lam
ECLT 5810 Data Preprocessing Prof. Wai Lam Why Data Preprocessing? Data in the real world is imperfect incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate
More informationData can be in the form of numbers, words, measurements, observations or even just descriptions of things.
+ What is Data? Data is a collection of facts. Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. In most cases, data needs to be interpreted and
More informationData Mining: Exploring Data. Lecture Notes for Chapter 3
Data Mining: Exploring Data Lecture Notes for Chapter 3 1 What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include
More informationData Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining
Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar What is data exploration? A preliminary exploration of the data to better understand its characteristics.
More informationData Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining
Data Mining: Exploring Data Lecture Notes for Data Exploration Chapter Introduction to Data Mining by Tan, Steinbach, Karpatne, Kumar 02/03/2018 Introduction to Data Mining 1 What is data exploration?
More informationData Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha
Data Preprocessing S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha 1 Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking
More informationChapter 3: Data Mining:
Chapter 3: Data Mining: 3.1 What is Data Mining? Data Mining is the process of automatically discovering useful information in large repository. Why do we need Data mining? Conventional database systems
More informationSTA 570 Spring Lecture 5 Tuesday, Feb 1
STA 570 Spring 2011 Lecture 5 Tuesday, Feb 1 Descriptive Statistics Summarizing Univariate Data o Standard Deviation, Empirical Rule, IQR o Boxplots Summarizing Bivariate Data o Contingency Tables o Row
More informationAcquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.
Summary Statistics Acquisition Description Exploration Examination what data is collected Characterizing properties of data. Exploring the data distribution(s). Identifying data quality problems. Selecting
More informationData Statistics Population. Census Sample Correlation... Statistical & Practical Significance. Qualitative Data Discrete Data Continuous Data
Data Statistics Population Census Sample Correlation... Voluntary Response Sample Statistical & Practical Significance Quantitative Data Qualitative Data Discrete Data Continuous Data Fewer vs Less Ratio
More informationCSE4334/5334 Data Mining 4 Data and Data Preprocessing. Chengkai Li University of Texas at Arlington Fall 2017
CSE4334/5334 Data Mining 4 Data and Data Preprocessing Chengkai Li University of Texas at Arlington Fall 2017 10 What is Data? Collection of data objects and their attributes Attributes An attribute is
More informationCS570 Introduction to Data Mining
CS570 Introduction to Data Mining Department of Mathematics and Computer Science Li Xiong Data Exploration and Data Preprocessing Data and attributes Data exploration Data pre-processing 2 10 What is Data?
More information2.1 Objectives. Math Chapter 2. Chapter 2. Variable. Categorical Variable EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES
EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES Chapter 2 2.1 Objectives 2.1 What Are the Types of Data? www.managementscientist.org 1. Know the definitions of a. Variable b. Categorical versus quantitative
More informationContents. Foreword to Second Edition. Acknowledgments About the Authors
Contents Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments About the Authors xxxi xxxv Chapter 1 Introduction 1 1.1 Why Data Mining? 1 1.1.1 Moving toward the Information Age 1
More informationTable of Contents (As covered from textbook)
Table of Contents (As covered from textbook) Ch 1 Data and Decisions Ch 2 Displaying and Describing Categorical Data Ch 3 Displaying and Describing Quantitative Data Ch 4 Correlation and Linear Regression
More informationRoad Map. Data types Measuring data Data cleaning Data integration Data transformation Data reduction Data discretization Summary
2. Data preprocessing Road Map Data types Measuring data Data cleaning Data integration Data transformation Data reduction Data discretization Summary 2 Data types Categorical vs. Numerical Scale types
More informationAND NUMERICAL SUMMARIES. Chapter 2
EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES Chapter 2 2.1 What Are the Types of Data? 2.1 Objectives www.managementscientist.org 1. Know the definitions of a. Variable b. Categorical versus quantitative
More informationName Date Types of Graphs and Creating Graphs Notes
Name Date Types of Graphs and Creating Graphs Notes Graphs are helpful visual representations of data. Different graphs display data in different ways. Some graphs show individual data, but many do not.
More informationData Mining: Exploring Data. Lecture Notes for Chapter 3
Data Mining: Exploring Data Lecture Notes for Chapter 3 Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler Look for accompanying R code on the course web site. Topics Exploratory Data Analysis
More informationAverages and Variation
Averages and Variation 3 Copyright Cengage Learning. All rights reserved. 3.1-1 Section 3.1 Measures of Central Tendency: Mode, Median, and Mean Copyright Cengage Learning. All rights reserved. 3.1-2 Focus
More informationM7D1.a: Formulate questions and collect data from a census of at least 30 objects and from samples of varying sizes.
M7D1.a: Formulate questions and collect data from a census of at least 30 objects and from samples of varying sizes. Population: Census: Biased: Sample: The entire group of objects or individuals considered
More informationData Mining and Analytics. Introduction
Data Mining and Analytics Introduction Data Mining Data mining refers to extracting or mining knowledge from large amounts of data It is also termed as Knowledge Discovery from Data (KDD) Mostly, data
More informationFrequency Distributions
Displaying Data Frequency Distributions After collecting data, the first task for a researcher is to organize and summarize the data so that it is possible to get a general overview of the results. Remember,
More informationChapter 6: DESCRIPTIVE STATISTICS
Chapter 6: DESCRIPTIVE STATISTICS Random Sampling Numerical Summaries Stem-n-Leaf plots Histograms, and Box plots Time Sequence Plots Normal Probability Plots Sections 6-1 to 6-5, and 6-7 Random Sampling
More informationTable Of Contents: xix Foreword to Second Edition
Data Mining : Concepts and Techniques Table Of Contents: Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments xxxi About the Authors xxxv Chapter 1 Introduction 1 (38) 1.1 Why Data
More informationMath 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency
Math 1 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency lowest value + highest value midrange The word average: is very ambiguous and can actually refer to the mean,
More informationSTA Module 2B Organizing Data and Comparing Distributions (Part II)
STA 2023 Module 2B Organizing Data and Comparing Distributions (Part II) Learning Objectives Upon completing this module, you should be able to 1 Explain the purpose of a measure of center 2 Obtain and
More informationSTA Learning Objectives. Learning Objectives (cont.) Module 2B Organizing Data and Comparing Distributions (Part II)
STA 2023 Module 2B Organizing Data and Comparing Distributions (Part II) Learning Objectives Upon completing this module, you should be able to 1 Explain the purpose of a measure of center 2 Obtain and
More informationThe basic arrangement of numeric data is called an ARRAY. Array is the derived data from fundamental data Example :- To store marks of 50 student
Organizing data Learning Outcome 1. make an array 2. divide the array into class intervals 3. describe the characteristics of a table 4. construct a frequency distribution table 5. constructing a composite
More informationMATH 1070 Introductory Statistics Lecture notes Descriptive Statistics and Graphical Representation
MATH 1070 Introductory Statistics Lecture notes Descriptive Statistics and Graphical Representation Objectives: 1. Learn the meaning of descriptive versus inferential statistics 2. Identify bar graphs,
More informationChapter 2. Descriptive Statistics: Organizing, Displaying and Summarizing Data
Chapter 2 Descriptive Statistics: Organizing, Displaying and Summarizing Data Objectives Student should be able to Organize data Tabulate data into frequency/relative frequency tables Display data graphically
More informationData Exploration and Preparation Data Mining and Text Mining (UIC Politecnico di Milano)
Data Exploration and Preparation Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining, : Concepts and Techniques", The Morgan Kaufmann
More informationTopic (3) SUMMARIZING DATA - TABLES AND GRAPHICS
Topic (3) SUMMARIZING DATA - TABLES AND GRAPHICS 3- Topic (3) SUMMARIZING DATA - TABLES AND GRAPHICS A) Frequency Distributions For Samples Defn: A FREQUENCY DISTRIBUTION is a tabular or graphical display
More informationUnit 7 Statistics. AFM Mrs. Valentine. 7.1 Samples and Surveys
Unit 7 Statistics AFM Mrs. Valentine 7.1 Samples and Surveys v Obj.: I will understand the different methods of sampling and studying data. I will be able to determine the type used in an example, and
More informationChapter 2: Descriptive Statistics
Chapter 2: Descriptive Statistics Student Learning Outcomes By the end of this chapter, you should be able to: Display data graphically and interpret graphs: stemplots, histograms and boxplots. Recognize,
More informationCHAPTER 2: SAMPLING AND DATA
CHAPTER 2: SAMPLING AND DATA This presentation is based on material and graphs from Open Stax and is copyrighted by Open Stax and Georgia Highlands College. OUTLINE 2.1 Stem-and-Leaf Graphs (Stemplots),
More informationAnalytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.
Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied
More informationSTA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 3 Descriptive Measures
STA 2023 Module 3 Descriptive Measures Learning Objectives Upon completing this module, you should be able to: 1. Explain the purpose of a measure of center. 2. Obtain and interpret the mean, median, and
More informationCHAPTER 3: Data Description
CHAPTER 3: Data Description You ve tabulated and made pretty pictures. Now what numbers do you use to summarize your data? Ch3: Data Description Santorico Page 68 You ll find a link on our website to a
More informationCluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1
Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods
More informationChapter 1. Looking at Data-Distribution
Chapter 1. Looking at Data-Distribution Statistics is the scientific discipline that provides methods to draw right conclusions: 1)Collecting the data 2)Describing the data 3)Drawing the conclusions Raw
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 2 Sajjad Haider Spring 2010 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric
More informationData Mining: Data. What is Data? Lecture Notes for Chapter 2. Introduction to Data Mining. Properties of Attribute Values. Types of Attributes
0 Data Mining: Data What is Data? Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Collection of data objects and their attributes An attribute is a property or characteristic
More informationData Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining
10 Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 What is Data? Collection of data objects
More informationWhat are we working with? Data Abstractions. Week 4 Lecture A IAT 814 Lyn Bartram
What are we working with? Data Abstractions Week 4 Lecture A IAT 814 Lyn Bartram Munzner s What-Why-How What are we working with? DATA abstractions, statistical methods Why are we doing it? Task abstractions
More informationMHPE 494: Data Analysis. Welcome! The Analytic Process
MHPE 494: Data Analysis Alan Schwartz, PhD Department of Medical Education Memoona Hasnain,, MD, PhD, MHPE Department of Family Medicine College of Medicine University of Illinois at Chicago Welcome! Your
More informationVisual Analytics. Visualizing multivariate data:
Visual Analytics 1 Visualizing multivariate data: High density time-series plots Scatterplot matrices Parallel coordinate plots Temporal and spectral correlation plots Box plots Wavelets Radar and /or
More informationVocabulary. 5-number summary Rule. Area principle. Bar chart. Boxplot. Categorical data condition. Categorical variable.
5-number summary 68-95-99.7 Rule Area principle Bar chart Bimodal Boxplot Case Categorical data Categorical variable Center Changing center and spread Conditional distribution Context Contingency table
More informationStatistical Methods. Instructor: Lingsong Zhang. Any questions, ask me during the office hour, or me, I will answer promptly.
Statistical Methods Instructor: Lingsong Zhang 1 Issues before Class Statistical Methods Lingsong Zhang Office: Math 544 Email: lingsong@purdue.edu Phone: 765-494-7913 Office Hour: Monday 1:00 pm - 2:00
More informationChapter 3 - Displaying and Summarizing Quantitative Data
Chapter 3 - Displaying and Summarizing Quantitative Data 3.1 Graphs for Quantitative Data (LABEL GRAPHS) August 25, 2014 Histogram (p. 44) - Graph that uses bars to represent different frequencies or relative
More informationSpecial Review Section. Copyright 2014 Pearson Education, Inc.
Special Review Section SRS-1--1 Special Review Section Chapter 1: The Where, Why, and How of Data Collection Chapter 2: Graphs, Charts, and Tables Describing Your Data Chapter 3: Describing Data Using
More informationOrganizing Your Data. Jenny Holcombe, PhD UT College of Medicine Nuts & Bolts Conference August 16, 3013
Organizing Your Data Jenny Holcombe, PhD UT College of Medicine Nuts & Bolts Conference August 16, 3013 Learning Objectives Identify Different Types of Variables Appropriately Naming Variables Constructing
More informationMean Tests & X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242
Mean Tests & X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242 Creation & Description of a Data Set * 4 Levels of Measurement * Nominal, ordinal, interval, ratio * Variable Types
More informationCHAPTER-13. Mining Class Comparisons: Discrimination between DifferentClasses: 13.4 Class Description: Presentation of Both Characterization and
CHAPTER-13 Mining Class Comparisons: Discrimination between DifferentClasses: 13.1 Introduction 13.2 Class Comparison Methods and Implementation 13.3 Presentation of Class Comparison Descriptions 13.4
More informationExploring and Understanding Data Using R.
Exploring and Understanding Data Using R. Loading the data into an R data frame: variable
More informationLAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA
LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA This lab will assist you in learning how to summarize and display categorical and quantitative data in StatCrunch. In particular, you will learn how to
More informationPrepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.
Chapter 2 2.1 Descriptive Statistics A stem-and-leaf graph, also called a stemplot, allows for a nice overview of quantitative data without losing information on individual observations. It can be a good
More informationPreprocessing Short Lecture Notes cse352. Professor Anita Wasilewska
Preprocessing Short Lecture Notes cse352 Professor Anita Wasilewska Data Preprocessing Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept
More information1. To condense data in a single value. 2. To facilitate comparisons between data.
The main objectives 1. To condense data in a single value. 2. To facilitate comparisons between data. Measures :- Locational (positional ) average Partition values Median Quartiles Deciles Percentiles
More informationChapter Two: Descriptive Methods 1/50
Chapter Two: Descriptive Methods 1/50 2.1 Introduction 2/50 2.1 Introduction We previously said that descriptive statistics is made up of various techniques used to summarize the information contained
More informationMATH 117 Statistical Methods for Management I Chapter Two
Jubail University College MATH 117 Statistical Methods for Management I Chapter Two There are a wide variety of ways to summarize, organize, and present data: I. Tables 1. Distribution Table (Categorical
More informationDATA PREPROCESSING. Tzompanaki Katerina
DATA PREPROCESSING Tzompanaki Katerina Background: Data storage formats Data in DBMS ODBC, JDBC protocols Data in flat files Fixed-width format (each column has a specific number of characters, filled
More informationCOSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor
COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality
More informationSTP 226 ELEMENTARY STATISTICS NOTES PART 2 - DESCRIPTIVE STATISTICS CHAPTER 3 DESCRIPTIVE MEASURES
STP 6 ELEMENTARY STATISTICS NOTES PART - DESCRIPTIVE STATISTICS CHAPTER 3 DESCRIPTIVE MEASURES Chapter covered organizing data into tables, and summarizing data with graphical displays. We will now use
More informationMeasures of Central Tendency
Measures of Central Tendency MATH 130, Elements of Statistics I J. Robert Buchanan Department of Mathematics Fall 2017 Introduction Measures of central tendency are designed to provide one number which
More informationCS378 Introduction to Data Mining. Data Exploration and Data Preprocessing. Li Xiong
CS378 Introduction to Data Mining Data Exploration and Data Preprocessing Li Xiong Data Exploration and Data Preprocessing Data and Attributes Data exploration Data pre-processing Data Mining: Concepts
More informationData Preprocessing. Slides by: Shree Jaswal
Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data
More informationNo. of blue jelly beans No. of bags
Math 167 Ch5 Review 1 (c) Janice Epstein CHAPTER 5 EXPLORING DATA DISTRIBUTIONS A sample of jelly bean bags is chosen and the number of blue jelly beans in each bag is counted. The results are shown in
More informationGetting to Know Your Data
Chapter 2 Getting to Know Your Data 2.1 Exercises 1. Give three additional commonly used statistical measures (i.e., not illustrated in this chapter) for the characterization of data dispersion, and discuss
More informationData Mining: Concepts and Techniques
Data Mining: Concepts and Techniques Chapter 2 Original Slides: Jiawei Han and Micheline Kamber Modification: Li Xiong Data Mining: Concepts and Techniques 1 Chapter 2: Data Preprocessing Why preprocess
More informationBasic Statistical Terms and Definitions
I. Basics Basic Statistical Terms and Definitions Statistics is a collection of methods for planning experiments, and obtaining data. The data is then organized and summarized so that professionals can
More informationChapter 2: Frequency Distributions
Chapter 2: Frequency Distributions Chapter Outline 2.1 Introduction to Frequency Distributions 2.2 Frequency Distribution Tables Obtaining ΣX from a Frequency Distribution Table Proportions and Percentages
More informationMiddle Years Data Analysis Display Methods
Middle Years Data Analysis Display Methods Double Bar Graph A double bar graph is an extension of a single bar graph. Any bar graph involves categories and counts of the number of people or things (frequency)
More informationWELCOME! Lecture 3 Thommy Perlinger
Quantitative Methods II WELCOME! Lecture 3 Thommy Perlinger Program Lecture 3 Cleaning and transforming data Graphical examination of the data Missing Values Graphical examination of the data It is important
More informationAP Statistics Prerequisite Packet
Types of Data Quantitative (or measurement) Data These are data that take on numerical values that actually represent a measurement such as size, weight, how many, how long, score on a test, etc. For these
More informationCHAPTER 2 DESCRIPTIVE STATISTICS
CHAPTER 2 DESCRIPTIVE STATISTICS 1. Stem-and-Leaf Graphs, Line Graphs, and Bar Graphs The distribution of data is how the data is spread or distributed over the range of the data values. This is one of
More informationChapter 2 Describing, Exploring, and Comparing Data
Slide 1 Chapter 2 Describing, Exploring, and Comparing Data Slide 2 2-1 Overview 2-2 Frequency Distributions 2-3 Visualizing Data 2-4 Measures of Center 2-5 Measures of Variation 2-6 Measures of Relative
More informationExploratory/Visual Data Analysis
Exploratory/Visual Data Analysis Intelligent Data Analysis http://www.mit.bme.hu/node/8036 9/14/2018 Budapest University of Technology and Economics Fault Tolerant Systems Research Group Budapesti Műszaki
More informationData Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing
Data Mining: Concepts and Techniques (3 rd ed.) Chapter 3 1 Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major Tasks in Data Preprocessing Data Cleaning Data Integration Data
More informationLESSON 3: CENTRAL TENDENCY
LESSON 3: CENTRAL TENDENCY Outline Arithmetic mean, median and mode Ungrouped data Grouped data Percentiles, fractiles, and quartiles Ungrouped data Grouped data 1 MEAN Mean is defined as follows: Sum
More informationNuts and Bolts Research Methods Symposium
Organizing Your Data Jenny Holcombe, PhD UT College of Medicine Nuts & Bolts Conference August 16, 3013 Topics to Discuss: Types of Variables Constructing a Variable Code Book Developing Excel Spreadsheets
More informationAt the end of the chapter, you will learn to: Present data in textual form. Construct different types of table and graphs
DATA PRESENTATION At the end of the chapter, you will learn to: Present data in textual form Construct different types of table and graphs Identify the characteristics of a good table and graph Identify
More informationData Foundations. Topic Objectives. and list subcategories of each. its properties. before producing a visualization. subsetting
CS 725/825 Information Visualization Fall 2013 Data Foundations Dr. Michele C. Weigle http://www.cs.odu.edu/~mweigle/cs725-f13/ Topic Objectives! Distinguish between ordinal and nominal values and list
More informationChapter 3: Data Description - Part 3. Homework: Exercises 1-21 odd, odd, odd, 107, 109, 118, 119, 120, odd
Chapter 3: Data Description - Part 3 Read: Sections 1 through 5 pp 92-149 Work the following text examples: Section 3.2, 3-1 through 3-17 Section 3.3, 3-22 through 3.28, 3-42 through 3.82 Section 3.4,
More informationThemes in the Texas CCRS - Mathematics
1. Compare real numbers. a. Classify numbers as natural, whole, integers, rational, irrational, real, imaginary, &/or complex. b. Use and apply the relative magnitude of real numbers by using inequality
More informationDSC 201: Data Analysis & Visualization
DSC 201: Data Analysis & Visualization Exploratory Data Analysis Dr. David Koop What is Exploratory Data Analysis? "Detective work" to summarize and explore datasets Includes: - Data acquisition and input
More informationExploratory Data Analysis
Chapter 10 Exploratory Data Analysis Definition of Exploratory Data Analysis (page 410) Definition 12.1. Exploratory data analysis (EDA) is a subfield of applied statistics that is concerned with the investigation
More information刘淇 School of Computer Science and Technology USTC
Data Exploration 刘淇 School of Computer Science and Technology USTC http://staff.ustc.edu.cn/~qiliuql/dm2013.html t t / l/dm2013 l What is data exploration? A preliminary exploration of the data to better
More informationData Mining: Exploring Data
Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar But we start with a brief discussion of the Friedman article and the relationship between Data
More informationDownloaded from
UNIT 2 WHAT IS STATISTICS? Researchers deal with a large amount of data and have to draw dependable conclusions on the basis of data collected for the purpose. Statistics help the researchers in making
More informationSections Graphical Displays and Measures of Center. Brian Habing Department of Statistics University of South Carolina.
STAT 515 Statistical Methods I Sections 2.1-2.3 Graphical Displays and Measures of Center Brian Habing Department of Statistics University of South Carolina Redistribution of these slides without permission
More informationCHAPTER 1. Introduction. Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data.
1 CHAPTER 1 Introduction Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data. Variable: Any characteristic of a person or thing that can be expressed
More informationPredict Outcomes and Reveal Relationships in Categorical Data
PASW Categories 18 Specifications Predict Outcomes and Reveal Relationships in Categorical Data Unleash the full potential of your data through predictive analysis, statistical learning, perceptual mapping,
More informationMean,Median, Mode Teacher Twins 2015
Mean,Median, Mode Teacher Twins 2015 Warm Up How can you change the non-statistical question below to make it a statistical question? How many pets do you have? Possible answer: What is your favorite type
More informationEx.1 constructing tables. a) find the joint relative frequency of males who have a bachelors degree.
Two-way Frequency Tables two way frequency table- a table that divides responses into categories. Joint relative frequency- the number of times a specific response is given divided by the sample. Marginal
More information2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.
Code No: M0502/R05 Set No. 1 1. (a) Explain data mining as a step in the process of knowledge discovery. (b) Differentiate operational database systems and data warehousing. [8+8] 2. (a) Briefly discuss
More informationSpatial Outlier Detection
Spatial Outlier Detection Chang-Tien Lu Department of Computer Science Northern Virginia Center Virginia Tech Joint work with Dechang Chen, Yufeng Kou, Jiang Zhao 1 Spatial Outlier A spatial data point
More information2.1: Frequency Distributions and Their Graphs
2.1: Frequency Distributions and Their Graphs Frequency Distribution - way to display data that has many entries - table that shows classes or intervals of data entries and the number of entries in each
More informationExploring Data data exploration Exploratory Data Analysis
3 Exploring Data The previous chapter addressed high-level data issues that are important in the knowledge discovery process This chapter provides an introduction to data exploration, which is a preliminary
More information15 Wyner Statistics Fall 2013
15 Wyner Statistics Fall 2013 CHAPTER THREE: CENTRAL TENDENCY AND VARIATION Summary, Terms, and Objectives The two most important aspects of a numerical data set are its central tendencies and its variation.
More information