Data Mining: Concepts and Techniques
|
|
- Kory Banks
- 6 years ago
- Views:
Transcription
1 Data Mining: Concepts and Techniques Chapter 2 Original Slides: Jiawei Han and Micheline Kamber Modification: Li Xiong Data Mining: Concepts and Techniques 1
2 Chapter 2: Data Preprocessing Why preprocess the data? Descriptive data summarization Data cleaning Data integration and transformation Data reduction Discretization and concept hierarchy generation Summary Data Mining: Concepts and Techniques 2
3 Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation= noisy: containing errors or outliers e.g., Salary= -10 inconsistent: containing discrepancies in codes or names e.g., Age= 42 Birthday= 03/07/1997 e.g., Was rating 1,2,3, now rating A, B, C e.g., discrepancy between duplicate records Data Mining: Concepts and Techniques 3
4 Why Is Data Dirty? Incomplete data may come from Not applicable data value when collected Different considerations between the time when the data was collected and when it is analyzed. Human/hardware/software problems Noisy data (incorrect values) may come from Faulty data collection instruments Human or computer error at data entry Errors in data transmission Inconsistent data may come from Different data sources Functional dependency violation (e.g., modify some linked data) Duplicate records also need data cleaning January 24, 2008 Data Mining: Concepts and Techniques 4
5 Multi-Dimensional Measure of Data Quality A well-accepted multidimensional view: Accuracy Completeness Consistency Timeliness Believability Value added Interpretability Accessibility Broad categories: Intrinsic, contextual, representational, and accessibility January 24, 2008 Data Mining: Concepts and Techniques 5
6 Major Tasks in Data Preprocessing Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration Integration of multiple databases, data cubes, or files Data transformation Normalization and aggregation Data reduction Obtains reduced representation in volume but produces the same or similar analytical results Data discretization Part of data reduction but with particular importance, especially for numerical data January 24, 2008 Data Mining: Concepts and Techniques 6
7 Forms of Data Preprocessing Data Mining: Concepts and Techniques 7
8 Chapter 2: Data Preprocessing Why preprocess the data? Descriptive data summarization Data cleaning Data integration and transformation Data reduction Discretization and concept hierarchy generation Summary January 24, 2008 Data Mining: Concepts and Techniques 8
9 Descriptive Data Summarization Motivation To better understand the data Descriptive statistics: describe basic features of data Graphical description Tabular description Summary statistics Descriptive data summarization Measuring central tendency how data seem similar Measuring statistical variability or dispersion of data how data differ Graphic display of descriptive data summarization January 24, 2008 Data Mining: Concepts and Techniques 9
10 Measuring the Central Tendency 1 Mean (sample vs. population): x = n Weighted arithmetic mean: i = 1 Trimmed mean: chopping extreme values n x i μ = x N x n i = 1 = n i = 1 w i w x i i Median Middle value if odd number of values, or average of the middle two values otherwise Estimated by interpolation (for grouped data): Mode median Value that occurs most frequently in the data = L 1 n / 2 ( + ( f median f ) l ) c Unimodal, bimodal, trimodal Empirical formula: mean mode = 3 ( mean median) January 24, 2008 Data Mining: Concepts and Techniques 10
11 Symmetric vs. Skewed Data Median, mean and mode of symmetric, positively and negatively skewed data Mean Median Mode January 24, 2008 Data Mining: Concepts and Techniques 11
12 Computational Issues Different types of measures Distributed measure can be computed by partitioning the data into smaller subsets. E.g. sum, count Algebraic measure can be computed by applying an algebraic function to one or more distributed measures. E.g.? Holistic measure must be computed on the entire dataset as a whole. E.g.? Selection algorithm: finding kth smallest number in a list E.g. min, max, median Selection by sorting: O(n* logn) Linear algorithms based on quicksort: O(n) January 24, 2008 Data Mining: Concepts and Techniques 12
13 The Long Tail Long tail: low-frequency population (e.g. wealth distribution) The Long Tail: the current and future business and economic models Previous empirical studies: Amazon, Netflix Products that are in low demand or have low sales volume can collectively make up a market share that rivals or exceeds the relatively few current bestsellers and blockbusters The primary value of the internet: providing access to products in the long tail Business and social implications mass market retailers: Amazon, Netflix, ebay content producers: YouTube The Long Tail. Chris Anderson, Wired, Oct The Long Tail: Why the Future of Business is Selling Less of More. Chris Anderson January 24, 2008 Data Mining: Concepts and Techniques 13
14 Measuring the Dispersion of Data Dispersion or variance: the degree to which numerical data tend to spread Range and Quartiles Range: difference between the largest and smallest values Percentile: the value of a variable below which a certain percent of data fall (algebraic or holistic?) Quartiles: Q 1 (25 th percentile), Median (50 th percentile), Q 3 (75 th percentile) Inter-quartile range: IQR = Q 3 Q 1 Five number summary: min, Q 1, M, Q 3, max (Boxplot) Outlier: usually, a value at least 1.5 x IQR higher/lower than Q3/Q1 Variance and standard deviation (sample: s, population: σ) s 2 = Variance: sample vs. population (algebraic or holistic?) 1 n 1 n i= 1 ( x i x) 2 1 = [ n 1 n n 2 xi ( i= 1 n i= 1 1 x ) i 2 ] σ n n = ( xi μ ) = N i= 1 N i= 1 x 2 i 2 μ Standard deviation s (or σ) is the square root of variance s 2 ( or σ 2) January 24, 2008 Data Mining: Concepts and Techniques 14
15 Graphic Displays of Basic Statistical Descriptions Histogram Boxplot Quantile plot Quantile-quantile (q-q) plot Scatter plot Loess (local regression) curve January 24, 2008 Data Mining: Concepts and Techniques 15
16 Histogram Analysis Graphical display of tabulated frequencies univariate graphical method (one attribute) data partitioned into disjoint buckets (typically equalwidth) a set of rectangles that reflect the counts or frequencies of values at the bucket Bar chart for categorical values Data Mining: Concepts and Techniques 16
17 Boxplot Analysis Visualizes five-number summary: The ends of the box are first and third quartiles (Q1 and Q3), i.e., the height of the box is IRQ The median (M) is marked by a line within the box Whiskers: two lines outside the box extend to Minimum and Maximum January 24, 2008 Data Mining: Concepts and Techniques 17
18 Example Boxplot: Profit Analysis January 24, 2008 Data Mining: Concepts and Techniques 18
19 Quantile Plot Displays all of the data for the given attribute Plots quantile information Each data point (xi, fi) indicates that approximately 100 f i % of the data are below or equal to the value x i January 24, 2008 Data Mining: Concepts and Techniques 19
20 Quantile-Quantile (Q-Q) Plot Graphs the quantiles of one univariate distribution against the corresponding quantiles of another Diagnosing differences between the probability distribution of two distributions January 24, 2008 Data Mining: Concepts and Techniques 20
21 Scatter plot Displays values for two numerical attributes (bivariate data) Each pair of values plotted as a point in the plane can suggest various kinds of correlations between variables with a certain confidence level: positive (rising), negative (falling), or null (uncorrelated). January 24, 2008 Data Mining: Concepts and Techniques 21
22 Example Scatter Plot Correlation between Wine Consumption and Heart Mortality US France Data Mining: Concepts and Techniques 22
23 Positively and Negatively Correlated Data Data Mining: Concepts and Techniques 23
24 Not Correlated Data Data Mining: Concepts and Techniques 24
25 Loess Curve Locally weighted scatter plot smoothing to provide better perception of the pattern of dependence Fitting simple models to localized subsets of the data January 24, 2008 Data Mining: Concepts and Techniques 25
26 Chapter 2: Data Preprocessing Why preprocess the data? Descriptive data summarization Data cleaning Data integration and transformation Data reduction Discretization and concept hierarchy generation Summary January 24, 2008 Data Mining: Concepts and Techniques 26
27 Data Cleaning Importance Data cleaning is one of the three biggest problems in data warehousing Ralph Kimball Data cleaning is the number one problem in data warehousing DCI survey Data cleaning tasks Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data Resolve redundancy caused by data integration January 24, 2008 Data Mining: Concepts and Techniques 27
28 Missing Data Data is not always available E.g., many tuples have no recorded value for several attributes, such as customer income in sales data Missing data may be due to equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time of entry not register history or changes of the data Missing data may need to be inferred. January 24, 2008 Data Mining: Concepts and Techniques 28
29 How to Handle Missing Values? Ignore the tuple: usually done when class label is missing (assuming the tasks in Fill in the missing value manually Fill in the missing value automatically a global constant : e.g., unknown, a new class?! the attribute mean the attribute mean for all samples belonging to the same class: smarter the most probable value: inference-based such as Bayesian formula or decision tree (Chap 6) January 24, 2008 Data Mining: Concepts and Techniques 29
30 Noisy Data Noise: random error or variance in a measured variable Incorrect attribute values may due to faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention Other data problems which requires data cleaning duplicate records incomplete data inconsistent data January 24, 2008 Data Mining: Concepts and Techniques 30
31 How to Handle Noisy Data? Binning and smoothing sort data and partition into bins (equal-frequency or equal-width) then smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. Regression smooth by fitting the data into a function with regression Clustering detect and remove outliers that fall outside clusters Combined computer and human inspection detect suspicious values and check by human (e.g., deal with possible outliers) January 24, 2008 Data Mining: Concepts and Techniques 31
32 Simple Discretization Methods: Binning Equal-width (distance) partitioning Divides the range into N intervals of equal size: uniform grid if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B A)/N. The most straightforward, but outliers may dominate presentation Skewed data is not handled well Equal-depth (frequency) partitioning Divides the range into N intervals, each containing approximately same number of samples Good data scaling Managing categorical attributes can be tricky January 24, 2008 Data Mining: Concepts and Techniques 32
33 Binning Methods for Data Smoothing Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into equal-frequency (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34 January 24, 2008 Data Mining: Concepts and Techniques 33
34 Regression y Y1 Y1 y = x + 1 X1 x January 24, 2008 Data Mining: Concepts and Techniques 34
35 Cluster Analysis January 24, 2008 Data Mining: Concepts and Techniques 35
36 Chapter 2: Data Preprocessing Why preprocess the data? Data cleaning Data integration Data transformation Data reduction Discretization and concept hierarchy generation Summary January 24, 2008 Data Mining: Concepts and Techniques 36
37 Data Integration Data integration: combines data from multiple sources into a unified view Architectures Data warehouse (tightly coupled) Federated database systems (loosely coupled) Database heterogeneity Semantic integration January 24, 2008 Data Mining: Concepts and Techniques 37
38 Data Warehouse Approach Client Query & Analysis Client Metadata Warehouse ETL Source Source Source
39 Advantages and Disadvantages of Advantages High query performance Data Warehouse Can operate when sources unavailable Extra information at warehouse Modification, summarization (aggregates), historical information Local processing at sources unaffected Disadvantages Data freshness Difficult to construct when only having access to query interface of local sources
40 Federated Database Systems Client Client Mediator Wrapper Wrapper Wrapper Source Source Source
41 Advantages and Disadvantages of Federated Database Systems Advantage No need to copy and store data at mediator More up-to-date data Only query interface needed at sources Disadvantage Query performance Source availability
42 Database Heterogeneity System Heterogeneity: use of different operating system, hardware platforms Schematic or Structural Heterogeneity: the native model or structure to store data differ in data sources. Syntactic Heterogeneity: differences in representation format of data Semantic Heterogeneity: differences in interpretation of the 'meaning' of data
43 Semantic Integration Problem: reconciling semantic heterogeneity Levels Schema matching (schema mapping) e.g., A.cust-id B.cust-# Data matching (data deduplication, record linkage, entity/object matching) e.g., Bill Clinton = William Clinton Challenges Semantics inferred from few information sources (data creators, documentation) -> rely on schema and data Schema and data unreliable and incomplete Global pair-wise matching computationally expensive In practice, 60-80% of resources spent on reconciling semantic heterogeneity in data sharing project
44 Schema Matching Techniques Rule based Learning based Type of matches 1-1 matches vs. complex matches (e.g. list-price = price *(1+tax_rate)) Information used Schema information: element names, data types, structures, number of sub-elements, integrity constraints Data information: value distributions, frequency of words External evidence: past matches, corpora of schemas Ontologies. E.g. Gene Ontology Multi-matcher architecture
45 Data Matching Or? record linkage data matching object identification entity resolution entity disambiguation duplicate detection record matching instance identification deduplication reference reconciliation database hardening Data Mining: Concepts and Techniques 45
46 Data Matching Techniques Rule based Probabilistic Record Linkage (Fellegi and Sunter, 1969) Similarity between pairs of attributes Combined scores representing probability of matching Threshold based decision Machine learning approaches New challenges Complex information spaces Multiple classes Data Mining: Concepts and Techniques 46
47 Chapter 2: Data Preprocessing Why preprocess the data? Data cleaning Data integration Data transformation Data reduction Discretization and concept hierarchy generation Summary January 24, 2008 Data Mining: Concepts and Techniques 47
48 Data Transformation Smoothing: remove noise from data (data cleaning) Aggregation: summarization E.g. Daily sales -> monthly sales Discretization and generalization E.g. age -> youth, middle-aged, senior (Statistical) Normalization: scaled to fall within a small, specified range E.g. income vs. age Attribute construction: construct new attributes from given ones E.g. birthday -> age January 24, 2008 Data Mining: Concepts and Techniques 48
49 Data Aggregation Data cubes store multidimensional aggregated information Multiple levels of aggregation for analysis at multiple granularities More on data warehouse and cube computation (chap 3, 4) January 24, 2008 Data Mining: Concepts and Techniques 49
50 Normalization Min-max normalization: [min A, max A ] to [new_min A, new_max A ] v v min maxa min A ' = ( new_ maxa new_ mina) + Ex. Let income [$12,000, $98,000] normalized to [0.0, 1.0]. Then $73,000 is mapped to Z-score normalization (μ: mean, σ: standard deviation): v μ A v ' = σ A Ex. Let μ = 54,000, σ = 16,000. Then Normalization by decimal scaling A new_ min 73,600 12,000 (1.0 0) + 0 = ,000 12,000 73,600 54,000 16,000 = v v'= Where j is the smallest integer such that Max( ν ) < 1 j 10 A January 24, 2008 Data Mining: Concepts and Techniques 50
51 Chapter 2: Data Preprocessing Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept hierarchy generation Summary January 24, 2008 Data Mining: Concepts and Techniques 51
52 Data Reduction Why data reduction? A database/data warehouse may store terabytes of data Complex data analysis/mining may take a very long time to run on the complete data set Data reduction Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results Data reduction strategies Dimensionality reduction Feature selection - attribute subset selection Feature extraction mapping data to a smaller number of features Instance reduction January 24, 2008 Data Mining: Concepts and Techniques 52
53 Feature Selection Select a set of attributes (features) such that the resulting probability distribution is as close as possible to the original distribution given all features Benefits Remove irrelevant or redundant attributes reduce # of attributes in the patterns Heuristic methods (# of choices?): Step-wise forward selection Step-wise backward elimination Combining forward selection and backward elimination Decision-tree induction (Chap 6. Classification) January 24, 2008 Data Mining: Concepts and Techniques 53
54 Example of Decision Tree Induction Initial attribute set: {A1, A2, A3, A4, A5, A6} A4? A1? A6? Class 1 Class 2 Class 1 Class 2 > Reduced attribute set: {A1, A4, A6} January 24, 2008 Data Mining: Concepts and Techniques 54
55 Feature Extraction Create new features (attributes) by combining/mapping existing ones Methods Principle Component Analysis Data compression methods Discrete Wavelet Transform Regression analysis January 24, 2008 Data Mining: Concepts and Techniques 55
56 Principal Component Analysis (PCA) Principle component analysis: find the dimensions that capture the most variance A linear mapping of the data to a new coordinate system such that the greatest variance lies on the first coordinate (the first principal component), the second greatest variance on the second coordinate, and so on. Steps Normalize input data: each attribute falls within the same range Compute k orthonormal (unit) vectors, i.e., principal components - each input data (vector) is a linear combination of the k principal component vectors The principal components are sorted in order of decreasing significance Weak components can be eliminated, i.e., those with low variance January 24, 2008 Data Mining: Concepts and Techniques 56
57 Illustration of Principal Component Analysis X2 Y2 Y1 X1 January 24, 2008 Data Mining: Concepts and Techniques 57
Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha
Data Preprocessing S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha 1 Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking
More informationCS570 Introduction to Data Mining
CS570 Introduction to Data Mining Department of Mathematics and Computer Science Li Xiong Data Exploration and Data Preprocessing Data and attributes Data exploration Data pre-processing 2 10 What is Data?
More informationData Preprocessing. Slides by: Shree Jaswal
Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data
More informationCS6220: DATA MINING TECHNIQUES
CS6220: DATA MINING TECHNIQUES 2: Data Pre-Processing Instructor: Yizhou Sun yzsun@ccs.neu.edu September 10, 2013 2: Data Pre-Processing Getting to know your data Basic Statistical Descriptions of Data
More informationData Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality
Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation = noisy: containing
More informationData Mining: Concepts and Techniques. (3 rd ed.) Chapter 3
Data Mining: Concepts and Techniques (3 rd ed.) Chapter 3 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University 2011 Han, Kamber & Pei. All rights
More informationUNIT 2 Data Preprocessing
UNIT 2 Data Preprocessing Lecture Topic ********************************************** Lecture 13 Why preprocess the data? Lecture 14 Lecture 15 Lecture 16 Lecture 17 Data cleaning Data integration and
More informationCS378 Introduction to Data Mining. Data Exploration and Data Preprocessing. Li Xiong
CS378 Introduction to Data Mining Data Exploration and Data Preprocessing Li Xiong Data Exploration and Data Preprocessing Data and Attributes Data exploration Data pre-processing Data Mining: Concepts
More informationPreprocessing Short Lecture Notes cse352. Professor Anita Wasilewska
Preprocessing Short Lecture Notes cse352 Professor Anita Wasilewska Data Preprocessing Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept
More informationData Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing
Data Mining: Concepts and Techniques (3 rd ed.) Chapter 3 1 Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major Tasks in Data Preprocessing Data Cleaning Data Integration Data
More informationData Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA
Obj ti Objectives Motivation: Why preprocess the Data? Data Preprocessing Techniques Data Cleaning Data Integration and Transformation Data Reduction Data Preprocessing Lecture 3/DMBI/IKI83403T/MTI/UI
More informationData Mining and Analytics. Introduction
Data Mining and Analytics Introduction Data Mining Data mining refers to extracting or mining knowledge from large amounts of data It is also termed as Knowledge Discovery from Data (KDD) Mostly, data
More information2. Data Preprocessing
2. Data Preprocessing Contents of this Chapter 2.1 Introduction 2.2 Data cleaning 2.3 Data integration 2.4 Data transformation 2.5 Data reduction Reference: [Han and Kamber 2006, Chapter 2] SFU, CMPT 459
More information3. Data Preprocessing. 3.1 Introduction
3. Data Preprocessing Contents of this Chapter 3.1 Introduction 3.2 Data cleaning 3.3 Data integration 3.4 Data transformation 3.5 Data reduction SFU, CMPT 740, 03-3, Martin Ester 84 3.1 Introduction Motivation
More informationChapter 2 Data Preprocessing
Chapter 2 Data Preprocessing CISC4631 1 Outline General data characteristics Data cleaning Data integration and transformation Data reduction Summary CISC4631 2 1 Types of Data Sets Record Relational records
More informationECLT 5810 Data Preprocessing. Prof. Wai Lam
ECLT 5810 Data Preprocessing Prof. Wai Lam Why Data Preprocessing? Data in the real world is imperfect incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate
More informationData Preprocessing in Python. Prof.Sushila Aghav
Data Preprocessing in Python Prof.Sushila Aghav Sushila.aghav@mitcoe.edu.in Content Why preprocess the data? Descriptive data summarization Data cleaning Data integration and transformation April 24, 2018
More informationCS570: Introduction to Data Mining
CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu, Ph.D. Some slides courtesy of Li Xiong, Ph.D. and 2011 Han, Kamber & Pei. Data Mining. Morgan Kaufmann.
More informationcse634 Data Mining Preprocessing Lecture Notes Chapter 2 Professor Anita Wasilewska
cse634 Data Mining Preprocessing Lecture Notes Chapter 2 Professor Anita Wasilewska Chapter 2: Data Preprocessing (book slide) Why preprocess the data? Descriptive data summarization Data cleaning Data
More informationData Mining: Concepts and Techniques. Chapter 2
Data Mining: Concepts and Techniques Chapter 2 Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj 2006 Jiawei Han and Micheline Kamber, All rights
More informationData Preprocessing. Data Mining 1
Data Preprocessing Today s real-world databases are highly susceptible to noisy, missing, and inconsistent data due to their typically huge size and their likely origin from multiple, heterogenous sources.
More informationSummary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4
Principles of Knowledge Discovery in Data Fall 2004 Chapter 3: Data Preprocessing Dr. Osmar R. Zaïane University of Alberta Summary of Last Chapter What is a data warehouse and what is it for? What is
More informationUNIT 2. DATA PREPROCESSING AND ASSOCIATION RULES
UNIT 2. DATA PREPROCESSING AND ASSOCIATION RULES Data Pre-processing-Data Cleaning, Integration, Transformation, Reduction, Discretization Concept Hierarchies-Concept Description: Data Generalization And
More informationRoad Map. Data types Measuring data Data cleaning Data integration Data transformation Data reduction Data discretization Summary
2. Data preprocessing Road Map Data types Measuring data Data cleaning Data integration Data transformation Data reduction Data discretization Summary 2 Data types Categorical vs. Numerical Scale types
More informationCS 521 Data Mining Techniques Instructor: Abdullah Mueen
CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 2: DATA TRANSFORMATION AND DIMENSIONALITY REDUCTION Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major Tasks
More informationBy Mahesh R. Sanghavi Associate professor, SNJB s KBJ CoE, Chandwad
By Mahesh R. Sanghavi Associate professor, SNJB s KBJ CoE, Chandwad Data Analytics life cycle Discovery Data preparation Preprocessing requirements data cleaning, data integration, data reduction, data
More informationECT7110. Data Preprocessing. Prof. Wai Lam. ECT7110 Data Preprocessing 1
ECT7110 Data Preprocessing Prof. Wai Lam ECT7110 Data Preprocessing 1 Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest,
More informationData Preprocessing. Chapter Why Preprocess the Data?
Contents 2 Data Preprocessing 3 2.1 Why Preprocess the Data?........................................ 3 2.2 Descriptive Data Summarization..................................... 6 2.2.1 Measuring the Central
More informationData Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1395
Data Mining Data preprocessing Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 15 Table of contents 1 Introduction 2 Data preprocessing
More informationData Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1394
Data Mining Data preprocessing Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 15 Table of contents 1 Introduction 2 Data preprocessing
More informationData Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation
Data Mining Part 2. Data Understanding and Preparation 2.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Normalization Attribute Construction Aggregation Attribute Subset Selection Discretization
More informationCHAPTER-13. Mining Class Comparisons: Discrimination between DifferentClasses: 13.4 Class Description: Presentation of Both Characterization and
CHAPTER-13 Mining Class Comparisons: Discrimination between DifferentClasses: 13.1 Introduction 13.2 Class Comparison Methods and Implementation 13.3 Presentation of Class Comparison Descriptions 13.4
More informationJarek Szlichta
Jarek Szlichta http://data.science.uoit.ca/ Open data Business Data Web Data Available at different formats 2 Data Scientist: The Sexiest Job of the 21 st Century Harvard Business Review Oct. 2012 (c)
More informationK236: Basis of Data Science
Schedule of K236 K236: Basis of Data Science Lecture 6: Data Preprocessing Lecturer: Tu Bao Ho and Hieu Chi Dam TA: Moharasan Gandhimathi and Nuttapong Sanglerdsinlapachai 1. Introduction to data science
More informationAcquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.
Summary Statistics Acquisition Description Exploration Examination what data is collected Characterizing properties of data. Exploring the data distribution(s). Identifying data quality problems. Selecting
More informationData preprocessing Functional Programming and Intelligent Algorithms
Data preprocessing Functional Programming and Intelligent Algorithms Que Tran Høgskolen i Ålesund 20th March 2017 1 Why data preprocessing? Real-world data tend to be dirty incomplete: lacking attribute
More informationInformation Management course
Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 03 : 13/10/2015 Data Mining: Concepts and Techniques (3 rd ed.) Chapter
More informationData Preprocessing. Erwin M. Bakker & Stefan Manegold. https://homepages.cwi.nl/~manegold/dbdm/
Data Preprocessing Erwin M. Bakker & Stefan Manegold https://homepages.cwi.nl/~manegold/dbdm/ http://liacs.leidenuniv.nl/~bakkerem2/dbdm/ s.manegold@liacs.leidenuniv.nl e.m.bakker@liacs.leidenuniv.nl 9/26/17
More informationDATA PREPROCESSING. Tzompanaki Katerina
DATA PREPROCESSING Tzompanaki Katerina Background: Data storage formats Data in DBMS ODBC, JDBC protocols Data in flat files Fixed-width format (each column has a specific number of characters, filled
More informationData Exploration and Preparation Data Mining and Text Mining (UIC Politecnico di Milano)
Data Exploration and Preparation Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining, : Concepts and Techniques", The Morgan Kaufmann
More informationData Mining Concepts & Techniques
Data Mining Concepts & Techniques Lecture No. 02 Data Processing, Data Mining Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology
More informationData Mining: Concepts and Techniques. Chapter 2
Data Mining: Concepts and Techniques Chapter 2 Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj 2006 Jiawei Han and Micheline Kamber, All rights
More informationData Preprocessing. Komate AMPHAWAN
Data Preprocessing Komate AMPHAWAN 1 Data cleaning (data cleansing) Attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data. 2 Missing value
More informationData Mining: Concepts and Techniques. (3 rd ed.) Chapter 3
Data Mining: Concepts and Techniques (3 rd ed.) Chapter 3 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University 2013 Han, Kamber & Pei. All rights
More informationChapter 6: DESCRIPTIVE STATISTICS
Chapter 6: DESCRIPTIVE STATISTICS Random Sampling Numerical Summaries Stem-n-Leaf plots Histograms, and Box plots Time Sequence Plots Normal Probability Plots Sections 6-1 to 6-5, and 6-7 Random Sampling
More informationBBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler, Sanjay Ranka
BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler, Sanjay Ranka Topics What is data? Definitions, terminology Types of data and datasets Data preprocessing Data Cleaning Data integration
More informationChapter 2 Describing, Exploring, and Comparing Data
Slide 1 Chapter 2 Describing, Exploring, and Comparing Data Slide 2 2-1 Overview 2-2 Frequency Distributions 2-3 Visualizing Data 2-4 Measures of Center 2-5 Measures of Variation 2-6 Measures of Relative
More informationMeasures of Central Tendency. A measure of central tendency is a value used to represent the typical or average value in a data set.
Measures of Central Tendency A measure of central tendency is a value used to represent the typical or average value in a data set. The Mean the sum of all data values divided by the number of values in
More informationCHAPTER 1. Introduction. Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data.
1 CHAPTER 1 Introduction Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data. Variable: Any characteristic of a person or thing that can be expressed
More informationPrepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.
Chapter 2 2.1 Descriptive Statistics A stem-and-leaf graph, also called a stemplot, allows for a nice overview of quantitative data without losing information on individual observations. It can be a good
More informationMeasures of Central Tendency
Page of 6 Measures of Central Tendency A measure of central tendency is a value used to represent the typical or average value in a data set. The Mean The sum of all data values divided by the number of
More informationIAT 355 Visual Analytics. Data and Statistical Models. Lyn Bartram
IAT 355 Visual Analytics Data and Statistical Models Lyn Bartram Exploring data Example: US Census People # of people in group Year # 1850 2000 (every decade) Age # 0 90+ Sex (Gender) # Male, female Marital
More informationData Mining: Concepts and Techniques. (3 rd ed.) Chapter 3
Data Mining: Concepts and Techniques (3 rd ed.) Chapter 3 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University 2013 Han, Kamber & Pei. All rights
More informationTo calculate the arithmetic mean, sum all the values and divide by n (equivalently, multiple 1/n): 1 n. = 29 years.
3: Summary Statistics Notation Consider these 10 ages (in years): 1 4 5 11 30 50 8 7 4 5 The symbol n represents the sample size (n = 10). The capital letter X denotes the variable. x i represents the
More informationCS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University
CS377: Database Systems Data Warehouse and Data Mining Li Xiong Department of Mathematics and Computer Science Emory University 1 1960s: Evolution of Database Technology Data collection, database creation,
More informationGetting to Know Your Data
Chapter 2 Getting to Know Your Data 2.1 Exercises 1. Give three additional commonly used statistical measures (i.e., not illustrated in this chapter) for the characterization of data dispersion, and discuss
More informationSTA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 3 Descriptive Measures
STA 2023 Module 3 Descriptive Measures Learning Objectives Upon completing this module, you should be able to: 1. Explain the purpose of a measure of center. 2. Obtain and interpret the mean, median, and
More informationThe first few questions on this worksheet will deal with measures of central tendency. These data types tell us where the center of the data set lies.
Instructions: You are given the following data below these instructions. Your client (Courtney) wants you to statistically analyze the data to help her reach conclusions about how well she is teaching.
More informationAverages and Variation
Averages and Variation 3 Copyright Cengage Learning. All rights reserved. 3.1-1 Section 3.1 Measures of Central Tendency: Mode, Median, and Mean Copyright Cengage Learning. All rights reserved. 3.1-2 Focus
More informationData Preprocessing UE 141 Spring 2013
Data Preprocessing UE 141 Spring 2013 Jing Gao SUNY Buffalo 1 Outline Data Data Preprocessing Improve data quality Prepare data for analysis Exploring Data Statistics Visualization 2 Document Data Each
More informationChapter 2. Descriptive Statistics: Organizing, Displaying and Summarizing Data
Chapter 2 Descriptive Statistics: Organizing, Displaying and Summarizing Data Objectives Student should be able to Organize data Tabulate data into frequency/relative frequency tables Display data graphically
More informationData Mining MTAT
Data Mining MTAT.03.183 (4AP = 6EAP) Descriptive analysis and preprocessing Jaak Vilo 2009 Fall Reminder shopping basket Database consists of sets of items bought together Describe the data Characterise
More informationChapter 3 - Displaying and Summarizing Quantitative Data
Chapter 3 - Displaying and Summarizing Quantitative Data 3.1 Graphs for Quantitative Data (LABEL GRAPHS) August 25, 2014 Histogram (p. 44) - Graph that uses bars to represent different frequencies or relative
More informationData can be in the form of numbers, words, measurements, observations or even just descriptions of things.
+ What is Data? Data is a collection of facts. Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. In most cases, data needs to be interpreted and
More informationChapter 1. Looking at Data-Distribution
Chapter 1. Looking at Data-Distribution Statistics is the scientific discipline that provides methods to draw right conclusions: 1)Collecting the data 2)Describing the data 3)Drawing the conclusions Raw
More informationMeasures of Dispersion
Measures of Dispersion 6-3 I Will... Find measures of dispersion of sets of data. Find standard deviation and analyze normal distribution. Day 1: Dispersion Vocabulary Measures of Variation (Dispersion
More informationUNIT 1A EXPLORING UNIVARIATE DATA
A.P. STATISTICS E. Villarreal Lincoln HS Math Department UNIT 1A EXPLORING UNIVARIATE DATA LESSON 1: TYPES OF DATA Here is a list of important terms that we must understand as we begin our study of statistics
More informationSTA Module 2B Organizing Data and Comparing Distributions (Part II)
STA 2023 Module 2B Organizing Data and Comparing Distributions (Part II) Learning Objectives Upon completing this module, you should be able to 1 Explain the purpose of a measure of center 2 Obtain and
More informationSTA Learning Objectives. Learning Objectives (cont.) Module 2B Organizing Data and Comparing Distributions (Part II)
STA 2023 Module 2B Organizing Data and Comparing Distributions (Part II) Learning Objectives Upon completing this module, you should be able to 1 Explain the purpose of a measure of center 2 Obtain and
More informationPreprocessing and Visualization. Jonathan Diehl
RWTH Aachen University Chair of Computer Science VI Prof. Dr.-Ing. Hermann Ney Seminar Data Mining WS 2003/2004 Preprocessing and Visualization Jonathan Diehl January 19, 2004 onathan Diehl Preprocessing
More informationTable of Contents (As covered from textbook)
Table of Contents (As covered from textbook) Ch 1 Data and Decisions Ch 2 Displaying and Describing Categorical Data Ch 3 Displaying and Describing Quantitative Data Ch 4 Correlation and Linear Regression
More informationData Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University
Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Exploratory data analysis tasks Examine the data, in search of structures
More informationCHAPTER 3: Data Description
CHAPTER 3: Data Description You ve tabulated and made pretty pictures. Now what numbers do you use to summarize your data? Ch3: Data Description Santorico Page 68 You ll find a link on our website to a
More informationA Survey on Data Preprocessing Techniques for Bioinformatics and Web Usage Mining
Volume 117 No. 20 2017, 785-794 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu A Survey on Data Preprocessing Techniques for Bioinformatics and Web
More informationData Preparation. Data Preparation. (Data pre-processing) Why Prepare Data? Why Prepare Data? Some data preparation is needed for all mining tools
Data Preparation Data Preparation (Data pre-processing) Why prepare the data? Discretization Data cleaning Data integration and transformation Data reduction, Feature selection 2 Why Prepare Data? Why
More informationDta Mining and Data Warehousing
CSCI645 Fall 23 Dta Mining and Data Warehousing Instructor: Qigang Gao, Office: CS219, Tel:494-3356, Email: qggao@cs.dal.ca Teaching Assistant: Christopher Jordan, Email: cjordan@cs.dal.ca Office Hours:
More informationSlide Copyright 2005 Pearson Education, Inc. SEVENTH EDITION and EXPANDED SEVENTH EDITION. Chapter 13. Statistics Sampling Techniques
SEVENTH EDITION and EXPANDED SEVENTH EDITION Slide - Chapter Statistics. Sampling Techniques Statistics Statistics is the art and science of gathering, analyzing, and making inferences from numerical information
More informationTable Of Contents: xix Foreword to Second Edition
Data Mining : Concepts and Techniques Table Of Contents: Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments xxxi About the Authors xxxv Chapter 1 Introduction 1 (38) 1.1 Why Data
More information2 CONTENTS. 3.8 Bibliographic Notes... 45
Contents 3 Data Preprocessing 3 3.1 Data Preprocessing: An Overview................. 4 3.1.1 Data Quality: Why Preprocess the Data?......... 4 3.1.2 Major Tasks in Data Preprocessing............. 5 3.2
More informationSTP 226 ELEMENTARY STATISTICS NOTES PART 2 - DESCRIPTIVE STATISTICS CHAPTER 3 DESCRIPTIVE MEASURES
STP 6 ELEMENTARY STATISTICS NOTES PART - DESCRIPTIVE STATISTICS CHAPTER 3 DESCRIPTIVE MEASURES Chapter covered organizing data into tables, and summarizing data with graphical displays. We will now use
More informationData Mining By IK Unit 4. Unit 4
Unit 4 Data mining can be classified into two categories 1) Descriptive mining: describes concepts or task-relevant data sets in concise, summarative, informative, discriminative forms 2) Predictive mining:
More informationFurther Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables
Further Maths Notes Common Mistakes Read the bold words in the exam! Always check data entry Remember to interpret data with the multipliers specified (e.g. in thousands) Write equations in terms of variables
More informationCHAPTER 2 DESCRIPTIVE STATISTICS
CHAPTER 2 DESCRIPTIVE STATISTICS 1. Stem-and-Leaf Graphs, Line Graphs, and Bar Graphs The distribution of data is how the data is spread or distributed over the range of the data values. This is one of
More informationData Preprocessing. Data Mining: Concepts and Techniques. c 2012 Elsevier Inc. All rights reserved.
3 Data Preprocessing Today s real-world databases are highly susceptible to noisy, missing, and inconsistent data due to their typically huge size (often several gigabytes or more) and their likely origin
More informationLearner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display
CURRICULUM MAP TEMPLATE Priority Standards = Approximately 70% Supporting Standards = Approximately 20% Additional Standards = Approximately 10% HONORS PROBABILITY AND STATISTICS Essential Questions &
More information2.1: Frequency Distributions and Their Graphs
2.1: Frequency Distributions and Their Graphs Frequency Distribution - way to display data that has many entries - table that shows classes or intervals of data entries and the number of entries in each
More informationContents. Foreword to Second Edition. Acknowledgments About the Authors
Contents Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments About the Authors xxxi xxxv Chapter 1 Introduction 1 1.1 Why Data Mining? 1 1.1.1 Moving toward the Information Age 1
More informationData Mining: Exploring Data. Lecture Notes for Chapter 3
Data Mining: Exploring Data Lecture Notes for Chapter 3 1 What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include
More informationFrequency Distributions
Displaying Data Frequency Distributions After collecting data, the first task for a researcher is to organize and summarize the data so that it is possible to get a general overview of the results. Remember,
More information10.4 Measures of Central Tendency and Variation
10.4 Measures of Central Tendency and Variation Mode-->The number that occurs most frequently; there can be more than one mode ; if each number appears equally often, then there is no mode at all. (mode
More information10.4 Measures of Central Tendency and Variation
10.4 Measures of Central Tendency and Variation Mode-->The number that occurs most frequently; there can be more than one mode ; if each number appears equally often, then there is no mode at all. (mode
More informationDAY 52 BOX-AND-WHISKER
DAY 52 BOX-AND-WHISKER VOCABULARY The Median is the middle number of a set of data when the numbers are arranged in numerical order. The Range of a set of data is the difference between the highest and
More informationStatistical Methods. Instructor: Lingsong Zhang. Any questions, ask me during the office hour, or me, I will answer promptly.
Statistical Methods Instructor: Lingsong Zhang 1 Issues before Class Statistical Methods Lingsong Zhang Office: Math 544 Email: lingsong@purdue.edu Phone: 765-494-7913 Office Hour: Monday 1:00 pm - 2:00
More informationData Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining
Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar What is data exploration? A preliminary exploration of the data to better understand its characteristics.
More informationData Collection, Preprocessing and Implementation
Chapter 6 Data Collection, Preprocessing and Implementation 6.1 Introduction Data collection is the loosely controlled method of gathering the data. Such data are mostly out of range, impossible data combinations,
More information15 Wyner Statistics Fall 2013
15 Wyner Statistics Fall 2013 CHAPTER THREE: CENTRAL TENDENCY AND VARIATION Summary, Terms, and Objectives The two most important aspects of a numerical data set are its central tendencies and its variation.
More informationData Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining
Data Mining: Exploring Data Lecture Notes for Data Exploration Chapter Introduction to Data Mining by Tan, Steinbach, Karpatne, Kumar 02/03/2018 Introduction to Data Mining 1 What is data exploration?
More informationVocabulary. 5-number summary Rule. Area principle. Bar chart. Boxplot. Categorical data condition. Categorical variable.
5-number summary 68-95-99.7 Rule Area principle Bar chart Bimodal Boxplot Case Categorical data Categorical variable Center Changing center and spread Conditional distribution Context Contingency table
More informationThis tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.
About the Tutorial Data Mining is defined as the procedure of extracting information from huge sets of data. In other words, we can say that data mining is mining knowledge from data. The tutorial starts
More informationName Date Types of Graphs and Creating Graphs Notes
Name Date Types of Graphs and Creating Graphs Notes Graphs are helpful visual representations of data. Different graphs display data in different ways. Some graphs show individual data, but many do not.
More information