UNIT 4 Characterization and Comparison

Size: px
Start display at page:

Download "UNIT 4 Characterization and Comparison"

Transcription

1 UNIT 4 Characterization and Comparison Lecture Topic ************************************************* Lecture 22 Lecture 23 Lecture 24 Lecture 25 Lecture 26 What is concept description? Data generalization and summarization based characterization Analytical characterization: Analysis of attribute relevance Mining class comparisons: Discriminating between different classes Mining descriptive statistical measures in large databases 1

2 Lecture 22 What is Concept Description? 2

3 What is Concept Description? Descriptive vs. predictive data mining Descriptive mining: describes concepts or task relevant data sets in concise, summarative, informative, discriminative forms Predictivemining: Basedon data and analysis, constructs models for the database, and predicts the trend and properties of unknown data Concept description: Characterization: provides a concise and succinct summarization ation of the given collection of data Comparison: provides descriptions comparing two or more collections of data 3 Lecture 22 What is Concept Description?

4 Concept Description vs. OLAP Concept description: can handle complex data types of the attributes and their aggregations a more automated t process OLAP: restricted to a small number of dimension and measure types user controlled process 4 Lecture 22 What is Concept Description?

5 Lecture 23 Data generalization and summarizationbased characterization 5

6 Data Generalization and Summarization based Characterization Dt Data generalization A process which abstracts a large set of task relevant data in a database from a lowconceptual levels to higher ones Approaches: Data cube approach(olap approach) Attribute oriented induction approach 4 5 Conceptual levels Lecture 23 Data generalization and summarization based characterization 6

7 Characterization: Data Cube Approach Perform computations and store results in data cubes Strength An efficient implementation of data generalization Computation of various kinds of measures count( ), sum( ), average( g(), max( ) Generalization and specialization can be performed on a data cube by roll up and drill down Limitationsit ti handle only dimensions of simple nonnumeric data and measures of simple aggregated numeric values. Lack of intelligent analysis, can t tell which dimensions should be used and what levels should the generalization reach 7 Lecture 23 Data generalization and summarization based characterization

8 Attribute Oriented Induction Proposed din 1989 (KDD 89 workshop) kh Not confined to categorical data nor particular measures. How it is done? Collect the task relevant data( initial relation) using a relational database query Perform generalization by attribute removal or attribute generalization. Apply aggregation by merging identical, generalized tuples and accumulating their respective counts. Interactive presentation with users. 8 Lecture 23 Data generalization and summarization based characterization

9 Data focusing Basic Principles of Attribute Oriented Induction task relevant data, including dimensions, andtheresult is theinitial relation. Attribute removal remove attribute A if there is a large set of distinct values for A but (1) there is no generalization operator on A, or (2) A s higher h level lconcepts are expressed in terms of other attributes. t 9 Lecture 23 Data generalization and summarization based characterization

10 Basic Principles of Attribute Oriented Induction Attribute generalization If there is a large set of distinct values for A, and there exists a set of generalization operators on A, then select an operator and generalize A. Attribute threshold h control Generalized relation threshold control control the final relation/rule size. 10 Lecture 23 Data generalization and summarization based characterization

11 Basic Algorithm for Attribute Oriented e tedinduction Initial Relation Query processing of task relevant data, deriving the initial relation. Pre Generalization Basedonthe analysis of the number of distinct values in each attribute, determine generalization plan for each attribute: removal? or how high to generalize? 11 Lecture 23 Data generalization and summarization based characterization

12 Basic Algorithm for Attribute Oriented Induction Prime Generalization Based on the PreGen plan, perform generalization to the right level to derive a prime generalized relation, accumulating the counts. Presentation User interaction: (1) adjust levels by drilling, (2) pivoting, (3) mapping into rules, cross tabs, visualization presentations. 12 Lecture 23 Data generalization and summarization based characterization

13 13 Example DMQL: Describe general characteristics of graduate students in the Big University database use Big_University_DB mine characteristics as Science_Students in relevance to name, gender, major, birth_place, birth_date, residence, phone#, gpa from student where status in graduate Corresponding SQL statement: Select name, gender, major, birth_place, birth_date, residence, phone#, gpa from student where status in { Msc, MBA, PhD } Lecture 23 Data generalization and summarization based characterization

14 Class Characterization: An Example Initial Relation Name Gender Major Birth-Place Birth_date Residence Phone # GPA Jim M CS Vancouver,BC, Main St., Woodman Canada Richmond Scott M CS Montreal, Que, st Ave., Lachance Canada Richmond Laura Lee F Physics Seattle, WA, USA Austin Ave., Burnaby Removed Retained Sci,Eng, Bus Country Age range City Removed Excl, VG,.. Gender Major Birth_region Age_range Residence GPA Count i M Science Canada Richmond Very-good 16 F Science Foreign Burnaby Excellent 22 Prime Generalized Relation 14 Gender Birth_Region Canada Foreign Total M F Total Lecture 23 Data generalization and summarization based characterization

15 Generalized relation Presentation of Generalized Results Relations where some or all attributes are generalized, with counts or other aggregation values accumulated. ltd Cross tabulation Mapping results into cross tabulation form (similar to contingency tables). Visualization techniques: Pie charts, bar charts, curves, cubes, and other visual forms. Quantitative characteristic rules Mapping generalized result into characteristic rules with quantitative information associated withit it, 15 grad ( x) male( x) birth _ region( x) " Canada"[ t:53%] birth _ region( x) " foreign"[ t:47%]. Lecture 23 Data generalization and summarization based characterization

16 Presentation Generalized Relation 16 Lecture 23 Data generalization and summarization based characterization

17 Presentation Crosstab 17 Lecture 23 Data generalization and summarization based characterization

18 Implementation by Cube Technology Construct a data cube on the fly for the given data mining query Facilitate efficient drill down analysis May increase the response time A balanced solution: precomputation of subprime relation Use a predefined & precomputed data cube Construct a data cube beforehand Facilitate not only the attribute oriented induction, but also attribute relevance analysis, dicing, slicing, roll up and drill down Cost of cube computation and the nontrivial storage overhead 18 Lecture 23 Data generalization and summarization based characterization

19 Lecture 24 Analytical characterization: Analysis of attribute relevance 19

20 Similarity: Characterization vs. OLAP Presentation ti of dt data summarization at multiple l levels l of abstraction. Interactivedrilling drilling, pivoting, slicing and dicing. Differences: Automated desired level allocation. Dimension relevance analysis and ranking when there are many relevant dimensions. Sophisticated typing on dimensions and measures. Analytical characterization: data dispersion analysis. 20 Lecture 24 Analytical characterization: Analysis of attribute relevance

21 Why? Attribute Relevance Analysis Which dimensions should be included? How high level of generalization? Automatic vs. interactive Reduce # attributes; easy to understand patterns What? statistical method for preprocessing data filter out irrelevant or weakly relevant attributes retain orrankrank therelevant attributes relevance related to dimensions and levels analytical characterization, analytical comparison 21 Lecture 24 Analytical characterization: Analysis of attribute relevance

22 Attribute relevance analysis Data Collection Analytical Generalization Use information gain analysis to identify highly relevant dimensions and levels. Relevance Analysis Sort and select the most relevant dimensions and levels. Attribute oriented Induction for class description Onselected dimension/level OLAP operations (drilling, slicing) on relevance rules 22 Lecture 24 Analytical characterization: Analysis of attribute relevance

23 Relevance Measures Quantitative relevance measure determines the classifying i power of an attribute within ihi a set of data. Methods information gain (ID3) gain ratio (C4.5) gini index 2 contingency table statistics uncertainty coefficient 23 Lecture 24 Analytical characterization: Analysis of attribute relevance

24 Decision tree Information Theoretic Approach each internal node tests an attribute each branch corresponds to attribute value each leaf node assigns a classification ID3 algorithm build decision tree based on training objects with known class labels to classify testing objects rank attributes with information gain measure minimal height the least number of tests to classify an object 24 Lecture 24 Analytical characterization: Analysis of attribute relevance

25 Top Down Induction of Decision Tree Attributes = {Outlook, Temperature, Humidity, Wind} PlayTennis = {yes, no} Outlook sunny overcast rain high Humidity normal yes strong Wind weak no yes no yes 25 Lecture 24 Analytical characterization: Analysis of attribute relevance

26 Entropy and Information Gain S contains s i tuples of class C i for i = {1,, m} Informationmeasuresmeasures info required to classifyany any arbitrary tuple m si si Entropy of attribute I( s1,s2,...,s A mwith ) values log 2 {a 1,a 2,,a v } i 1 s s v s... s 1j mj E(A) I( s1 j,...,s mj) Informationgained j 1 by s branching on attribute A Gain(A) I(s 1,s 2,..., sm) E(A) 26 Lecture 24 Analytical characterization: Analysis of attribute relevance

27 Task Example: Analytical Characterization Mine general characteristics describing graduate students using analytical characterization Given attributes name, gender, major, birth_place, birth_date, phone#, and gpa Gen(a i ) = concept hierarchies on a i U i = attribute analytical thresholds for a i T i = attribute generalization thresholds for a i R = attribute relevance threshold 27 Lecture 24 Analytical characterization: Analysis of attribute relevance

28 Example: Analytical Characterization 1. Data collection target class: graduate student contrasting class: undergraduate student 2. Analytical generalization using U i attribute removal remove name and phone# attribute generalization generalize major, birth_place, birth_date andgpa accumulate counts candidate relation: gender, major, birth_country, age_range and gpa 28 Lecture 24 Analytical characterization: Analysis of attribute relevance

29 Example: Analytical characterization gender major birth_country age_range gpa count M Science Canada Very_good 16 F Science Foreign Excellent 22 M Engineering Foreign Excellent 18 F Science Foreign Excellent 25 M Si Science Canada Excellent 21 F Engineering Canada Excellent 18 Candidate relation for Target class: Graduate students ( =120) gender major birth_country age_range gpa count M Science Foreign <20 Very_good 18 F Business Canada <20 Fair 20 M Business Canada <20 Fair 22 F Science Canada Fair 24 M Engineering Foreign Very_good 22 F Engineering Canada <20 Excellent Candidate relation for Contrasting class: Undergraduate students ( =130) Lecture 24 Analytical characterization: Analysis of attribute relevance

30 Example: Analytical characterization 3. Relevance analysis Calculate expected info required to classify an arbitrary tuple I(s ) I( 120, 130 ) log 2 log ,s 2. Calculate entropy of each attribute: e.g. major For major= Science : S 11 =84 S 21 =42 I(s 11,s 21 )= For major= Engineering : S 12 =36 S 22 =46 I(s 12,s 22 )= For major= Business : S 13 =0 S 23 =42 I(s 13,s 23 )=0 Number of grad students t in Science Number of undergrad students in Science 30 Lecture 24 Analytical characterization: Analysis of attribute relevance

31 Example: Analytical Characterization Calculate expected info required to classify a given sample if S is partitioned according to the attribute E(major) I( s Gain(major) 82 ) I( s ) I( s ,s21 12,s22 13,s23. I(s 1,s 2 ) E(major) ) Calculate information gain for each attribute Information gain for all attributes Gain(gender) = Gain(birth_country) = Gain(major) = Gain(gpa) = Gain(age_range) = Lecture 24 Analytical characterization: Analysis of attribute relevance

32 Example: Analytical characterization 4. Initial working relation (W 0 0) derivation R = 0.1 remove irrelevant/weakly relevant attributes from candidate relation => drop gender,, birth_country remove contrasting class candidate relation major age_range gpa count Science Very_good 16 Science Excellent 47 Science Excellent 21 Engineering Excellent 18 Engineering Excellent 18 Initial target class working relation W 0 : Graduate students 5. Perform attribute oriented induction on W 0 using T i 32 Lecture 24 Analytical characterization: Analysis of attribute relevance

33 Lecture 25 Mining class comparisons: Discriminating between different classes 33

34 Comparison Mining Class Comparisons Comparing two or more classes. Method Partition the set of relevant data into the target class and the contrasting classes Generalize both classes to the same high level concepts Compare tuples with the same high level descriptions Lecture 25 Mining class comparisons: Discriminating between different classes 34

35 Mining Class Comparisons Present for every tuple its description and two measures: support distribution within single class comparison distribution between classes Highlight the tuples with strong discriminant features Relevance Analysis Find attributes (features) which best distinguish different classes. Lecture 25 Mining class comparisons: Discriminating between different classes 35

36 Example: Analytical comparison Task Comparegraduate and undergraduate students using discriminant rule. DMQL query use Big_University_DB mine comparison as grad_vs_undergrad_students in relevance to name, gender, major, birth_place, birth_date, residence, phone#, gpa for graduate_students where status in graduate versus undergraduate_students where status in undergraduate analyze count% from student t Lecture 25 Mining class comparisons: Discriminating between different classes 36

37 Given Example: Analytical comparison attributes name, gender, major, birth_place, birth_date, residence, phone# and gpa Gen(a i ) = concept hierarchies on attributes a i U i = attribute analytical thresholds for attributes a i T i = attribute generalization thresholds for attributes a i R = attribute relevance threshold Lecture 25 Mining class comparisons: Discriminating between different classes 37

38 Example: Analytical comparison 1. Data collection target and contrasting classes 2. Attribute relevance analysis remove attributes name, gender, major, phone# 3. Synchronous generalization controlled by user specified dimension thresholds prime target and contrasting classes relations/cuboids Lecture 25 Mining class comparisons: Discriminating between different classes 38

39 Example: Analytical comparison Birth_country Age_range Gpa Count% Canada Good 5.53% Canada Good 2.32% Canada Over_30 Very_good 5.86% Other Over_30 Excellent 4.68% Prime generalized relation for the target class: Graduate students Birth_country Age_range Gpa Count% Canada Fair 5.53% Canada Good 4.53% Canada Good 5.02% Other Over_30 Excellent 068% 0.68% Prime generalized relation for the contrasting class: Undergraduate students Lecture 25 Mining class comparisons: Discriminating between different classes 39

40 Example: Analytical comparison 4. Drill down, roll up and other OLAP operations on target and contrasting classes to adjust levels of abstractions of resulting description 5. Presentation as generalized relations, crosstabs, bar charts, pie charts, or rules contrasting measures to reflect comparison between target and contrasting classes count% Lecture 25 Mining class comparisons: Discriminating between different classes 40

41 Quantitative Discriminant Rules Cj = target class q a = a generalized tuple covers some tuples of class but can also cover some tuples of contrasting class d weight range: [0, 1] d weight ih count(q a C j ) quantitative discriminant i i rule form m i 1 count(q a C i ) X, target_class(x) condition(x) [d : d_weight] Lecture 25 Mining class comparisons: Discriminating between different classes 41

42 Example: Quantitative Discriminant Rule Status Birth_country Age_range Gpa Count Graduate Canada Good 90 Undergraduate Canada Good 210 Count distribution ib i between graduate and undergraduate d students for a generalized tuple Quantitative discriminant rule X, graduate_ student( X) birth_ country ( X ) " Canada" age_ range ( X ) "25 30" gpa ( X ) " good" where 90/(90+120) = 30% [ d :30%] Lecture 25 Mining class comparisons: Discriminating between different classes 42

43 Class Description Quantitative characteristic rule X, target_class(x) necessary condition(x) Quantitative discriminant rule X, target_class(x) condition(x) sufficient i Quantitative description rule X, condition target_cla ss(x) necessary and sufficient [t : t_weight] [d : d_weight] (X)[t : w 1,d : w 1]... condition n(x) [t : w n, d : w n] 1 Lecture 25 Mining class comparisons: Discriminating between different classes 43

44 Example: Quantitative Description Rule Location/item TV Computer Both_items Count t-wt d-wt Count t-wt d-wt Count t-wt d-wt Europe 80 25% 40% % 30% % 32% N_Am % 60% % 70% % 68% Both_ regions % 100% % 100% % 100% Crosstab showing associated t-weight, d-weight values and total number (in thousands) of TVs and computers sold at AllElectronics in 1998 Quantitative a description rule for target class Europe uope X,Europe(X) (item(x) " TV" ) [t : 25%,d : 40%] (item(x) " computer" ) [t : 75%,d :30%] Lecture 25 Mining class comparisons: Discriminating between different classes 44

45 Lecture 26 Mining descriptive statistical measures in large databases 45

46 Mining Data Dispersion Characteristics Motivation To better understand the data: central tendency, variation and spread Data dispersion characteristics median, max, min, quantiles, outliers, variance, etc. Numerical dimensions correspond to sorted intervals Data dispersion: analyzed with multiple granularities of precision Boxplot or quantile analysis on sorted intervals Dispersion analysis on computed measures Folding measures into numerical dimensions Boxplot or quantile analysis on the transformed cube 46 Lecture 26 Mining descriptive statistical measures in large databases

47 Measuring the Central Tendency 1 Mean x n n i 1 x i Weighted arithmetic mean Median: A holistic measure x n i 1 n i 1 w i w x i i Middle value if odd number of values, or average of the middle two values otherwise estimated by interpolation Mode Value that occurs most frequently in the data n / 2 ( f ) l median L1 ( ) c f median Unimodal, bimodal, trimodal Empirical formula: mean mode 3 ( mean median ) 47 Lecture 26 Mining descriptive statistical measures in large databases

48 Measuring the Dispersion of Data Quartiles, outliers and boxplots Quartiles: Q th th 1 (25 percentile), Q 3 (75 percentile) Inter quartile range: IQR = Q 3 Q 1 Five number summary: min, Q 1, M, Q 3, max Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot outlier individually Outlier: usually, a value higher/lower than1 1.5 x IQR Variance and standard deviation Variance s 2 : (algebraic, scalable computation) s 2 n 1 1 n i 1 ( x i x ) 2 n n 2 x i ( i 1 n i 1 Standard deviation s is the square root of variance s 2 n 1 1 [ 1 x i ) 2 ] 48 Lecture 26 Mining descriptive statistical measures in large databases

49 Boxplot Analysis Five number summaryof a distribution: Boxplot Minimum, Q1, M, Q3, Maximum Data is represented with a box The ends of the box are at the first and third quartiles, i.e., the height of the box is IRQ The median is marked by a line within the box Whiskers: two lines outside the box extend to Minimum i and Maximum 49 Lecture 26 Mining descriptive statistical measures in large databases

50 A Boxplot A boxplot 50 Lecture 26 Mining descriptive statistical measures in large databases

51 Visualization of Data Dispersion: Boxplot Analysis 51 Lecture 26 Mining descriptive statistical measures in large databases

52 Mining Descriptive Statistical Measures in Large Databases Variance s 2 1 n 1 i i n 1 i 1 n 1 n ( x x) x x i 2 Standarddeviation: deviation: the square root of the variance Measures spread about the mean It is zero if and only if all the values are equal Both the deviation and the variance are algebraic 52 Lecture 26 Mining descriptive statistical measures in large databases

53 Histogram Analysis Graph displays of basic statistical class descriptions Frequency histograms A univariate graphical method Consists of a set of rectangles that reflect the counts or frequencies of the classes present in the given data 53 Lecture 26 Mining descriptive statistical measures in large databases

54 Quantile Plot Displays all of the data (allowing the user to assess both the overall behavior and unusual occurrences) Plots quantile information For a data x i data sorted in increasing order, f i indicates that approximately 100 f i % of the data are below or equal to the value x i 54 Lecture 26 Mining descriptive statistical measures in large databases

55 Quantile Quantile (Q Q) Plot Graphs the quantiles of one univariate distribution against the corresponding quantiles of another Allows the user to view whether h there is a shift in going from one distribution to another Lecture 26 Mining descriptive statistical measures in large databases 55

56 Scatter plot Provides a first look at bivariate data to see clusters of points, outliers, etc Each pair of values is treated as a pair of coordinates and plotted as points in the plane 56 Lecture 26 Mining descriptive statistical measures in large databases

57 Loess Curve Adds a smooth curve to a scatter plot in order to provide better perception of the pattern of dependence Loess curve is fitted by setting two parameters: a smoothing parameter, and the degree of the polynomials that are fitted by the regression 57 Lecture 26 Mining descriptive statistical measures in large databases

58 Histogram Boxplot Graphic Displays of Basic Statistical Descriptions Quantile plot: each value x i is paired with f i indicating that approximately 100 f i % of data are x i Quantile quantile (q q) plot: graphs the quantiles of one univariant distribution against the corresponding quantiles of another Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane Loess (local regression) curve: add a smoothcurve to a scatter plot to provide better perception of the pattern of dependence 58 Lecture 26 Mining descriptive statistical measures in large databases

59 AO Induction vs. Learning from example Paradigm Difference inphilosophies and basic assumptions Positive and negative samples in learning from example: positive used for generalization, negative for specialization Positive ii samples only in data mining: i hence generalization based, i to drill down backtrack the generalization to a previous state Difference in methods of generalizations Machine learning generalizes on a tuple by tuple basis Data mining generalizes on an attribute by attribute basis 59 Lecture 26 Mining descriptive statistical measures in large databases

60 Comparison of Entire vs. Factored Version Space 60 Lecture 26 Mining descriptive statistical measures in large databases

61 Incremental and Parallel Mining of Concept tdescription Incremental mining: revision based on newly added data DB Generalize DB to the same level of abstraction in the generalized relation R to derive R Union R U R, i.e., merge counts and other statistical information to produce a new relation R Similar philosophy can be applied to data sampling, parallel and/or distributed mining, etc. 61 Lecture 26 Mining descriptive statistical measures in large databases

Data Mining By IK Unit 4. Unit 4

Data Mining By IK Unit 4. Unit 4 Unit 4 Data mining can be classified into two categories 1) Descriptive mining: describes concepts or task-relevant data sets in concise, summarative, informative, discriminative forms 2) Predictive mining:

More information

CS590D: Data Mining Chris Clifton

CS590D: Data Mining Chris Clifton CS590D: Data Mining Chris Clifton February 24, 2005 Concept Description Concept Description: Characterization and Comparison What is concept description? Data generalization and summarization-based characterization

More information

CS490D: Introduction to Data Mining Prof. Chris Clifton

CS490D: Introduction to Data Mining Prof. Chris Clifton CS490D: Introduction to Data Mining Prof. Chris Clifton March 12, 2004 Data Mining Process How to Choose a Data Mining System? Commercial data mining systems have little in common Different data mining

More information

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha Data Preprocessing S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha 1 Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking

More information

CHAPTER-13. Mining Class Comparisons: Discrimination between DifferentClasses: 13.4 Class Description: Presentation of Both Characterization and

CHAPTER-13. Mining Class Comparisons: Discrimination between DifferentClasses: 13.4 Class Description: Presentation of Both Characterization and CHAPTER-13 Mining Class Comparisons: Discrimination between DifferentClasses: 13.1 Introduction 13.2 Class Comparison Methods and Implementation 13.3 Presentation of Class Comparison Descriptions 13.4

More information

Data Mining and Analytics. Introduction

Data Mining and Analytics. Introduction Data Mining and Analytics Introduction Data Mining Data mining refers to extracting or mining knowledge from large amounts of data It is also termed as Knowledge Discovery from Data (KDD) Mostly, data

More information

ECLT 5810 Data Preprocessing. Prof. Wai Lam

ECLT 5810 Data Preprocessing. Prof. Wai Lam ECLT 5810 Data Preprocessing Prof. Wai Lam Why Data Preprocessing? Data in the real world is imperfect incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate

More information

Data Mining Primitives, Languages, and System Data Mining Primitives Task-relevant data The kinds of knowledge to be mined: Background Knowledge

Data Mining Primitives, Languages, and System Data Mining Primitives Task-relevant data The kinds of knowledge to be mined: Background Knowledge Data Mining Primitives, Languages, and System Data Mining Primitives Task-relevant data The kinds of knowledge to be mined: Background Knowledge Interestingness measures Presentation and visualization

More information

Getting to Know Your Data

Getting to Know Your Data Chapter 2 Getting to Know Your Data 2.1 Exercises 1. Give three additional commonly used statistical measures (i.e., not illustrated in this chapter) for the characterization of data dispersion, and discuss

More information

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data. Summary Statistics Acquisition Description Exploration Examination what data is collected Characterizing properties of data. Exploring the data distribution(s). Identifying data quality problems. Selecting

More information

Road Map. Data types Measuring data Data cleaning Data integration Data transformation Data reduction Data discretization Summary

Road Map. Data types Measuring data Data cleaning Data integration Data transformation Data reduction Data discretization Summary 2. Data preprocessing Road Map Data types Measuring data Data cleaning Data integration Data transformation Data reduction Data discretization Summary 2 Data types Categorical vs. Numerical Scale types

More information

Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Data Mining: Concepts and Techniques Chapter 2 Original Slides: Jiawei Han and Micheline Kamber Modification: Li Xiong Data Mining: Concepts and Techniques 1 Chapter 2: Data Preprocessing Why preprocess

More information

Data Warehousing 資料倉儲

Data Warehousing 資料倉儲 Data Warehousing 資料倉儲 Data Cube Computation and Data Generation 992DW05 MI4 Tue. 8,9 (15:10-17:00) L413 Min-Yuh Day 戴敏育 Assistant Professor 專任助理教授 Dept. of Information Management, Tamkang University 淡江大學資訊管理學系

More information

2.1 Objectives. Math Chapter 2. Chapter 2. Variable. Categorical Variable EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES

2.1 Objectives. Math Chapter 2. Chapter 2. Variable. Categorical Variable EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES Chapter 2 2.1 Objectives 2.1 What Are the Types of Data? www.managementscientist.org 1. Know the definitions of a. Variable b. Categorical versus quantitative

More information

Data Preprocessing. Chapter Why Preprocess the Data?

Data Preprocessing. Chapter Why Preprocess the Data? Contents 2 Data Preprocessing 3 2.1 Why Preprocess the Data?........................................ 3 2.2 Descriptive Data Summarization..................................... 6 2.2.1 Measuring the Central

More information

AND NUMERICAL SUMMARIES. Chapter 2

AND NUMERICAL SUMMARIES. Chapter 2 EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES Chapter 2 2.1 What Are the Types of Data? 2.1 Objectives www.managementscientist.org 1. Know the definitions of a. Variable b. Categorical versus quantitative

More information

3. Data Analysis and Statistics

3. Data Analysis and Statistics 3. Data Analysis and Statistics 3.1 Visual Analysis of Data 3.2.1 Basic Statistics Examples 3.2.2 Basic Statistical Theory 3.3 Normal Distributions 3.4 Bivariate Data 3.1 Visual Analysis of Data Visual

More information

IAT 355 Visual Analytics. Data and Statistical Models. Lyn Bartram

IAT 355 Visual Analytics. Data and Statistical Models. Lyn Bartram IAT 355 Visual Analytics Data and Statistical Models Lyn Bartram Exploring data Example: US Census People # of people in group Year # 1850 2000 (every decade) Age # 0 90+ Sex (Gender) # Male, female Marital

More information

Measures of Dispersion

Measures of Dispersion Measures of Dispersion 6-3 I Will... Find measures of dispersion of sets of data. Find standard deviation and analyze normal distribution. Day 1: Dispersion Vocabulary Measures of Variation (Dispersion

More information

CS570 Introduction to Data Mining

CS570 Introduction to Data Mining CS570 Introduction to Data Mining Department of Mathematics and Computer Science Li Xiong Data Exploration and Data Preprocessing Data and attributes Data exploration Data pre-processing 2 10 What is Data?

More information

Chapter 6: DESCRIPTIVE STATISTICS

Chapter 6: DESCRIPTIVE STATISTICS Chapter 6: DESCRIPTIVE STATISTICS Random Sampling Numerical Summaries Stem-n-Leaf plots Histograms, and Box plots Time Sequence Plots Normal Probability Plots Sections 6-1 to 6-5, and 6-7 Random Sampling

More information

Table of Contents (As covered from textbook)

Table of Contents (As covered from textbook) Table of Contents (As covered from textbook) Ch 1 Data and Decisions Ch 2 Displaying and Describing Categorical Data Ch 3 Displaying and Describing Quantitative Data Ch 4 Correlation and Linear Regression

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES 2: Data Pre-Processing Instructor: Yizhou Sun yzsun@ccs.neu.edu September 10, 2013 2: Data Pre-Processing Getting to know your data Basic Statistical Descriptions of Data

More information

Vocabulary. 5-number summary Rule. Area principle. Bar chart. Boxplot. Categorical data condition. Categorical variable.

Vocabulary. 5-number summary Rule. Area principle. Bar chart. Boxplot. Categorical data condition. Categorical variable. 5-number summary 68-95-99.7 Rule Area principle Bar chart Bimodal Boxplot Case Categorical data Categorical variable Center Changing center and spread Conditional distribution Context Contingency table

More information

Data Mining: Exploring Data. Lecture Notes for Chapter 3

Data Mining: Exploring Data. Lecture Notes for Chapter 3 Data Mining: Exploring Data Lecture Notes for Chapter 3 1 What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include

More information

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University Cse634 DATA MINING TEST REVIEW Professor Anita Wasilewska Computer Science Department Stony Brook University Preprocessing stage Preprocessing: includes all the operations that have to be performed before

More information

STA 570 Spring Lecture 5 Tuesday, Feb 1

STA 570 Spring Lecture 5 Tuesday, Feb 1 STA 570 Spring 2011 Lecture 5 Tuesday, Feb 1 Descriptive Statistics Summarizing Univariate Data o Standard Deviation, Empirical Rule, IQR o Boxplots Summarizing Bivariate Data o Contingency Tables o Row

More information

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar What is data exploration? A preliminary exploration of the data to better understand its characteristics.

More information

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining Data Mining: Exploring Data Lecture Notes for Data Exploration Chapter Introduction to Data Mining by Tan, Steinbach, Karpatne, Kumar 02/03/2018 Introduction to Data Mining 1 What is data exploration?

More information

2.1: Frequency Distributions and Their Graphs

2.1: Frequency Distributions and Their Graphs 2.1: Frequency Distributions and Their Graphs Frequency Distribution - way to display data that has many entries - table that shows classes or intervals of data entries and the number of entries in each

More information

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things.

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. + What is Data? Data is a collection of facts. Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. In most cases, data needs to be interpreted and

More information

CHAPTER 1. Introduction. Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data.

CHAPTER 1. Introduction. Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data. 1 CHAPTER 1 Introduction Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data. Variable: Any characteristic of a person or thing that can be expressed

More information

10.4 Measures of Central Tendency and Variation

10.4 Measures of Central Tendency and Variation 10.4 Measures of Central Tendency and Variation Mode-->The number that occurs most frequently; there can be more than one mode ; if each number appears equally often, then there is no mode at all. (mode

More information

10.4 Measures of Central Tendency and Variation

10.4 Measures of Central Tendency and Variation 10.4 Measures of Central Tendency and Variation Mode-->The number that occurs most frequently; there can be more than one mode ; if each number appears equally often, then there is no mode at all. (mode

More information

What are we working with? Data Abstractions. Week 4 Lecture A IAT 814 Lyn Bartram

What are we working with? Data Abstractions. Week 4 Lecture A IAT 814 Lyn Bartram What are we working with? Data Abstractions Week 4 Lecture A IAT 814 Lyn Bartram Munzner s What-Why-How What are we working with? DATA abstractions, statistical methods Why are we doing it? Task abstractions

More information

STA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 3 Descriptive Measures

STA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 3 Descriptive Measures STA 2023 Module 3 Descriptive Measures Learning Objectives Upon completing this module, you should be able to: 1. Explain the purpose of a measure of center. 2. Obtain and interpret the mean, median, and

More information

Chapter 2. Descriptive Statistics: Organizing, Displaying and Summarizing Data

Chapter 2. Descriptive Statistics: Organizing, Displaying and Summarizing Data Chapter 2 Descriptive Statistics: Organizing, Displaying and Summarizing Data Objectives Student should be able to Organize data Tabulate data into frequency/relative frequency tables Display data graphically

More information

STA Module 2B Organizing Data and Comparing Distributions (Part II)

STA Module 2B Organizing Data and Comparing Distributions (Part II) STA 2023 Module 2B Organizing Data and Comparing Distributions (Part II) Learning Objectives Upon completing this module, you should be able to 1 Explain the purpose of a measure of center 2 Obtain and

More information

STA Learning Objectives. Learning Objectives (cont.) Module 2B Organizing Data and Comparing Distributions (Part II)

STA Learning Objectives. Learning Objectives (cont.) Module 2B Organizing Data and Comparing Distributions (Part II) STA 2023 Module 2B Organizing Data and Comparing Distributions (Part II) Learning Objectives Upon completing this module, you should be able to 1 Explain the purpose of a measure of center 2 Obtain and

More information

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data. Code No: M0502/R05 Set No. 1 1. (a) Explain data mining as a step in the process of knowledge discovery. (b) Differentiate operational database systems and data warehousing. [8+8] 2. (a) Briefly discuss

More information

Lecture Notes 3: Data summarization

Lecture Notes 3: Data summarization Lecture Notes 3: Data summarization Highlights: Average Median Quartiles 5-number summary (and relation to boxplots) Outliers Range & IQR Variance and standard deviation Determining shape using mean &

More information

Chapter 2 Describing, Exploring, and Comparing Data

Chapter 2 Describing, Exploring, and Comparing Data Slide 1 Chapter 2 Describing, Exploring, and Comparing Data Slide 2 2-1 Overview 2-2 Frequency Distributions 2-3 Visualizing Data 2-4 Measures of Center 2-5 Measures of Variation 2-6 Measures of Relative

More information

UNIT 1A EXPLORING UNIVARIATE DATA

UNIT 1A EXPLORING UNIVARIATE DATA A.P. STATISTICS E. Villarreal Lincoln HS Math Department UNIT 1A EXPLORING UNIVARIATE DATA LESSON 1: TYPES OF DATA Here is a list of important terms that we must understand as we begin our study of statistics

More information

Chapter 2 Data Preprocessing

Chapter 2 Data Preprocessing Chapter 2 Data Preprocessing CISC4631 1 Outline General data characteristics Data cleaning Data integration and transformation Data reduction Summary CISC4631 2 1 Types of Data Sets Record Relational records

More information

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display CURRICULUM MAP TEMPLATE Priority Standards = Approximately 70% Supporting Standards = Approximately 20% Additional Standards = Approximately 10% HONORS PROBABILITY AND STATISTICS Essential Questions &

More information

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency Math 1 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency lowest value + highest value midrange The word average: is very ambiguous and can actually refer to the mean,

More information

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables Further Maths Notes Common Mistakes Read the bold words in the exam! Always check data entry Remember to interpret data with the multipliers specified (e.g. in thousands) Write equations in terms of variables

More information

CHAPTER 3: Data Description

CHAPTER 3: Data Description CHAPTER 3: Data Description You ve tabulated and made pretty pictures. Now what numbers do you use to summarize your data? Ch3: Data Description Santorico Page 68 You ll find a link on our website to a

More information

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining. About the Tutorial Data Mining is defined as the procedure of extracting information from huge sets of data. In other words, we can say that data mining is mining knowledge from data. The tutorial starts

More information

Chapter 3 - Displaying and Summarizing Quantitative Data

Chapter 3 - Displaying and Summarizing Quantitative Data Chapter 3 - Displaying and Summarizing Quantitative Data 3.1 Graphs for Quantitative Data (LABEL GRAPHS) August 25, 2014 Histogram (p. 44) - Graph that uses bars to represent different frequencies or relative

More information

AP Statistics Summer Assignment:

AP Statistics Summer Assignment: AP Statistics Summer Assignment: Read the following and use the information to help answer your summer assignment questions. You will be responsible for knowing all of the information contained in this

More information

Regression III: Advanced Methods

Regression III: Advanced Methods Lecture 3: Distributions Regression III: Advanced Methods William G. Jacoby Michigan State University Goals of the lecture Examine data in graphical form Graphs for looking at univariate distributions

More information

SLStats.notebook. January 12, Statistics:

SLStats.notebook. January 12, Statistics: Statistics: 1 2 3 Ways to display data: 4 generic arithmetic mean sample 14A: Opener, #3,4 (Vocabulary, histograms, frequency tables, stem and leaf) 14B.1: #3,5,8,9,11,12,14,15,16 (Mean, median, mode,

More information

CS570: Introduction to Data Mining

CS570: Introduction to Data Mining CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu, Ph.D. Some slides courtesy of Li Xiong, Ph.D. and 2011 Han, Kamber & Pei. Data Mining. Morgan Kaufmann.

More information

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SHRI ANGALAMMAN COLLEGE OF ENGINEERING & TECHNOLOGY (An ISO 9001:2008 Certified Institution) SIRUGANOOR,TRICHY-621105. DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year / Semester: IV/VII CS1011-DATA

More information

WELCOME! Lecture 3 Thommy Perlinger

WELCOME! Lecture 3 Thommy Perlinger Quantitative Methods II WELCOME! Lecture 3 Thommy Perlinger Program Lecture 3 Cleaning and transforming data Graphical examination of the data Missing Values Graphical examination of the data It is important

More information

Create a bar graph that displays the data from the frequency table in Example 1. See the examples on p Does our graph look different?

Create a bar graph that displays the data from the frequency table in Example 1. See the examples on p Does our graph look different? A frequency table is a table with two columns, one for the categories and another for the number of times each category occurs. See Example 1 on p. 247. Create a bar graph that displays the data from the

More information

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order. Chapter 2 2.1 Descriptive Statistics A stem-and-leaf graph, also called a stemplot, allows for a nice overview of quantitative data without losing information on individual observations. It can be a good

More information

Understanding and Comparing Distributions. Chapter 4

Understanding and Comparing Distributions. Chapter 4 Understanding and Comparing Distributions Chapter 4 Objectives: Boxplot Calculate Outliers Comparing Distributions Timeplot The Big Picture We can answer much more interesting questions about variables

More information

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

What Is Data Mining? CMPT 354: Database I -- Data Mining 2 Data Mining What Is Data Mining? Mining data mining knowledge Data mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data CMPT

More information

Summarising Data. Mark Lunt 09/10/2018. Arthritis Research UK Epidemiology Unit University of Manchester

Summarising Data. Mark Lunt 09/10/2018. Arthritis Research UK Epidemiology Unit University of Manchester Summarising Data Mark Lunt Arthritis Research UK Epidemiology Unit University of Manchester 09/10/2018 Summarising Data Today we will consider Different types of data Appropriate ways to summarise these

More information

Frequency Distributions

Frequency Distributions Displaying Data Frequency Distributions After collecting data, the first task for a researcher is to organize and summarize the data so that it is possible to get a general overview of the results. Remember,

More information

1 Overview of Statistics; Essential Vocabulary

1 Overview of Statistics; Essential Vocabulary 1 Overview of Statistics; Essential Vocabulary Statistics: the science of collecting, organizing, analyzing, and interpreting data in order to make decisions Population and sample Population: the entire

More information

Data Mining: Exploring Data

Data Mining: Exploring Data Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar But we start with a brief discussion of the Friedman article and the relationship between Data

More information

UNIT 1: NUMBER LINES, INTERVALS, AND SETS

UNIT 1: NUMBER LINES, INTERVALS, AND SETS ALGEBRA II CURRICULUM OUTLINE 2011-2012 OVERVIEW: 1. Numbers, Lines, Intervals and Sets 2. Algebraic Manipulation: Rational Expressions and Exponents 3. Radicals and Radical Equations 4. Function Basics

More information

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10) CMPUT 391 Database Management Systems Data Mining Textbook: Chapter 17.7-17.11 (without 17.10) University of Alberta 1 Overview Motivation KDD and Data Mining Association Rules Clustering Classification

More information

Statistical Methods. Instructor: Lingsong Zhang. Any questions, ask me during the office hour, or me, I will answer promptly.

Statistical Methods. Instructor: Lingsong Zhang. Any questions, ask me during the office hour, or  me, I will answer promptly. Statistical Methods Instructor: Lingsong Zhang 1 Issues before Class Statistical Methods Lingsong Zhang Office: Math 544 Email: lingsong@purdue.edu Phone: 765-494-7913 Office Hour: Monday 1:00 pm - 2:00

More information

Part I, Chapters 4 & 5. Data Tables and Data Analysis Statistics and Figures

Part I, Chapters 4 & 5. Data Tables and Data Analysis Statistics and Figures Part I, Chapters 4 & 5 Data Tables and Data Analysis Statistics and Figures Descriptive Statistics 1 Are data points clumped? (order variable / exp. variable) Concentrated around one value? Concentrated

More information

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4 Principles of Knowledge Discovery in Data Fall 2004 Chapter 3: Data Preprocessing Dr. Osmar R. Zaïane University of Alberta Summary of Last Chapter What is a data warehouse and what is it for? What is

More information

Data Mining: Concepts and Techniques. Chapter 2

Data Mining: Concepts and Techniques. Chapter 2 Data Mining: Concepts and Techniques Chapter 2 Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj 2006 Jiawei Han and Micheline Kamber, All rights

More information

1. Basic Steps for Data Analysis Data Editor. 2.4.To create a new SPSS file

1. Basic Steps for Data Analysis Data Editor. 2.4.To create a new SPSS file 1 SPSS Guide 2009 Content 1. Basic Steps for Data Analysis. 3 2. Data Editor. 2.4.To create a new SPSS file 3 4 3. Data Analysis/ Frequencies. 5 4. Recoding the variable into classes.. 5 5. Data Analysis/

More information

Chapter 1. Looking at Data-Distribution

Chapter 1. Looking at Data-Distribution Chapter 1. Looking at Data-Distribution Statistics is the scientific discipline that provides methods to draw right conclusions: 1)Collecting the data 2)Describing the data 3)Drawing the conclusions Raw

More information

Quantitative - One Population

Quantitative - One Population Quantitative - One Population The Quantitative One Population VISA procedures allow the user to perform descriptive and inferential procedures for problems involving one population with quantitative (interval)

More information

LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA

LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA This lab will assist you in learning how to summarize and display categorical and quantitative data in StatCrunch. In particular, you will learn how to

More information

Exploratory Data Analysis

Exploratory Data Analysis Chapter 10 Exploratory Data Analysis Definition of Exploratory Data Analysis (page 410) Definition 12.1. Exploratory data analysis (EDA) is a subfield of applied statistics that is concerned with the investigation

More information

Chapter2 Description of samples and populations. 2.1 Introduction.

Chapter2 Description of samples and populations. 2.1 Introduction. Chapter2 Description of samples and populations. 2.1 Introduction. Statistics=science of analyzing data. Information collected (data) is gathered in terms of variables (characteristics of a subject that

More information

Univariate Statistics Summary

Univariate Statistics Summary Further Maths Univariate Statistics Summary Types of Data Data can be classified as categorical or numerical. Categorical data are observations or records that are arranged according to category. For example:

More information

Measures of Central Tendency

Measures of Central Tendency Page of 6 Measures of Central Tendency A measure of central tendency is a value used to represent the typical or average value in a data set. The Mean The sum of all data values divided by the number of

More information

The first few questions on this worksheet will deal with measures of central tendency. These data types tell us where the center of the data set lies.

The first few questions on this worksheet will deal with measures of central tendency. These data types tell us where the center of the data set lies. Instructions: You are given the following data below these instructions. Your client (Courtney) wants you to statistically analyze the data to help her reach conclusions about how well she is teaching.

More information

CHAPTER 2 DESCRIPTIVE STATISTICS

CHAPTER 2 DESCRIPTIVE STATISTICS CHAPTER 2 DESCRIPTIVE STATISTICS 1. Stem-and-Leaf Graphs, Line Graphs, and Bar Graphs The distribution of data is how the data is spread or distributed over the range of the data values. This is one of

More information

刘淇 School of Computer Science and Technology USTC

刘淇 School of Computer Science and Technology USTC Data Exploration 刘淇 School of Computer Science and Technology USTC http://staff.ustc.edu.cn/~qiliuql/dm2013.html t t / l/dm2013 l What is data exploration? A preliminary exploration of the data to better

More information

COMP 465: Data Mining Classification Basics

COMP 465: Data Mining Classification Basics Supervised vs. Unsupervised Learning COMP 465: Data Mining Classification Basics Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Supervised

More information

Classification with Decision Tree Induction

Classification with Decision Tree Induction Classification with Decision Tree Induction This algorithm makes Classification Decision for a test sample with the help of tree like structure (Similar to Binary Tree OR k-ary tree) Nodes in the tree

More information

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3 Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3 January 25, 2007 CSE-4412: Data Mining 1 Chapter 6 Classification and Prediction 1. What is classification? What is prediction?

More information

Slide Copyright 2005 Pearson Education, Inc. SEVENTH EDITION and EXPANDED SEVENTH EDITION. Chapter 13. Statistics Sampling Techniques

Slide Copyright 2005 Pearson Education, Inc. SEVENTH EDITION and EXPANDED SEVENTH EDITION. Chapter 13. Statistics Sampling Techniques SEVENTH EDITION and EXPANDED SEVENTH EDITION Slide - Chapter Statistics. Sampling Techniques Statistics Statistics is the art and science of gathering, analyzing, and making inferences from numerical information

More information

Measures of Central Tendency. A measure of central tendency is a value used to represent the typical or average value in a data set.

Measures of Central Tendency. A measure of central tendency is a value used to represent the typical or average value in a data set. Measures of Central Tendency A measure of central tendency is a value used to represent the typical or average value in a data set. The Mean the sum of all data values divided by the number of values in

More information

Enduring Understandings: Some basic math skills are required to be reviewed in preparation for the course.

Enduring Understandings: Some basic math skills are required to be reviewed in preparation for the course. Curriculum Map for Functions, Statistics and Trigonometry September 5 Days Targeted NJ Core Curriculum Content Standards: N-Q.1, N-Q.2, N-Q.3, A-CED.1, A-REI.1, A-REI.3 Enduring Understandings: Some basic

More information

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA Obj ti Objectives Motivation: Why preprocess the Data? Data Preprocessing Techniques Data Cleaning Data Integration and Transformation Data Reduction Data Preprocessing Lecture 3/DMBI/IKI83403T/MTI/UI

More information

Section 9: One Variable Statistics

Section 9: One Variable Statistics The following Mathematics Florida Standards will be covered in this section: MAFS.912.S-ID.1.1 MAFS.912.S-ID.1.2 MAFS.912.S-ID.1.3 Represent data with plots on the real number line (dot plots, histograms,

More information

Table Of Contents: xix Foreword to Second Edition

Table Of Contents: xix Foreword to Second Edition Data Mining : Concepts and Techniques Table Of Contents: Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments xxxi About the Authors xxxv Chapter 1 Introduction 1 (38) 1.1 Why Data

More information

Mineração de Dados Aplicada

Mineração de Dados Aplicada Data Exploration August, 9 th 2017 DCC ICEx UFMG Summary of the last session Data mining Data mining is an empiricism; It can be seen as a generalization of querying; It lacks a unified theory; It implies

More information

UNIT 2 Data Preprocessing

UNIT 2 Data Preprocessing UNIT 2 Data Preprocessing Lecture Topic ********************************************** Lecture 13 Why preprocess the data? Lecture 14 Lecture 15 Lecture 16 Lecture 17 Data cleaning Data integration and

More information

Data Science. Data Analyst. Data Scientist. Data Architect

Data Science. Data Analyst. Data Scientist. Data Architect Data Science Data Analyst Data Analysis in Excel Programming in R Introduction to Python/SQL/Tableau Data Visualization in R / Tableau Exploratory Data Analysis Data Scientist Inferential Statistics &

More information

Data Collection, Preprocessing and Implementation

Data Collection, Preprocessing and Implementation Chapter 6 Data Collection, Preprocessing and Implementation 6.1 Introduction Data collection is the loosely controlled method of gathering the data. Such data are mostly out of range, impossible data combinations,

More information

Exploring and Understanding Data Using R.

Exploring and Understanding Data Using R. Exploring and Understanding Data Using R. Loading the data into an R data frame: variable

More information

DATA WAREHOUING UNIT I

DATA WAREHOUING UNIT I BHARATHIDASAN ENGINEERING COLLEGE NATTRAMAPALLI DEPARTMENT OF COMPUTER SCIENCE SUB CODE & NAME: IT6702/DWDM DEPT: IT Staff Name : N.RAMESH DATA WAREHOUING UNIT I 1. Define data warehouse? NOV/DEC 2009

More information

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Contents. Foreword to Second Edition. Acknowledgments About the Authors Contents Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments About the Authors xxxi xxxv Chapter 1 Introduction 1 1.1 Why Data Mining? 1 1.1.1 Moving toward the Information Age 1

More information

Mathematics. Algebra, Functions, and Data Analysis Curriculum Guide. Revised 2010

Mathematics. Algebra, Functions, and Data Analysis Curriculum Guide. Revised 2010 Mathematics Algebra, Functions, and Data Analysis Curriculum Guide Revised 010 This page is intentionally left blank. Introduction The Mathematics Curriculum Guide serves as a guide for teachers when planning

More information

3 Graphical Displays of Data

3 Graphical Displays of Data 3 Graphical Displays of Data Reading: SW Chapter 2, Sections 1-6 Summarizing and Displaying Qualitative Data The data below are from a study of thyroid cancer, using NMTR data. The investigators looked

More information

Probability and Statistics. Copyright Cengage Learning. All rights reserved.

Probability and Statistics. Copyright Cengage Learning. All rights reserved. Probability and Statistics Copyright Cengage Learning. All rights reserved. 14.5 Descriptive Statistics (Numerical) Copyright Cengage Learning. All rights reserved. Objectives Measures of Central Tendency:

More information