Course on Data Mining ( )

Size: px
Start display at page:

Download "Course on Data Mining ( )"

Transcription

1 Course on Data Mining ( ) Intro/Ass. Rules 24./ Episodes Home Exam Clustering KDD Process Text Mining Appl./Summary Data mining: KDD Process 1 Course on Data Mining ( ) Today Today's subject: o KDD Process Next week's program: o Lecture: Data mining applications, future, summary o Exercise: KDD Process o Seminar: KDD Process Data mining: KDD Process 2 1

2 KDD process Overview Overview Preprocessing Post-processing Summary Data mining: KDD Process 3 What is KDD? A process! Aim: the selection and processing of data for o the identification of novel, accurate, and useful patterns, and o the modeling of real-world phenomena Data mining is a major component of the KDD process Data mining: KDD Process 4 2

3 Typical KDD process Target data set Selection Operational Database Raw data Input data Preprocessing Cleaned Verified Focused Data mining Postprocessing Selection Results Utilization Selected usable patterns Data mining: KDD Process 5 Phases of the KDD process (1) Learning the domain Eval. of interestingness Preprocessing Creating a target data set Data cleaning, integration and transformation Data reduction and projection Choosing the DM task Data mining: KDD Process 6 3

4 Phases of the KDD process (2) Choosing the DM algorithm(s) Data mining: search Postprocessing Pattern evaluation and interpretation Knowledge presentation Use of discovered knowledge Data mining: KDD Process 7 Preprocessing - overview Preprocessing Why data preprocessing? Data cleaning Data integration and transformation Data reduction Data mining: KDD Process 8 4

5 Why data preprocessing? Aim: to select the data relevant with respect to the task in hand to be mined Data in the real world is dirty o incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data o noisy: containing errors or outliers o inconsistent: containing discrepancies in codes or names No quality data, no quality mining results! Data mining: KDD Process 9 Measures of data quality o accuracy o completeness o consistency o timeliness o believability o value added o interpretability o accessibility Data mining: KDD Process 10 5

6 Preprocessing tasks (1) Data cleaning o fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration o integration of multiple databases, files, etc. Data transformation o normalization and aggregation Data mining: KDD Process 11 Preprocessing tasks (2) Data reduction (including discretization) o obtains reduced representation in volume, but produces the same or similar analytical results o data discretization is part of data reduction, but with particular importance, especially for numerical data Data mining: KDD Process 12 6

7 Preprocessing tasks (3) Data Cleaning Data Integration Data Transformation Data Reduction Data mining: KDD Process 13 Data cleaning tasks Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data Data mining: KDD Process 14 7

8 Missing Data Data is not always available Missing data may be due to o equipment malfunction o inconsistent with other recorded data, and thus deleted o data not entered due to misunderstanding o certain data may not be considered important at the time of entry o not register history or changes of the data Missing data may need to be inferred Data mining: KDD Process 15 How to Handle Missing Data? (1) Ignore the tuple o usually done when the class label is missing o not effective, when the percentage of missing values per attribute varies considerably Fill in the missing value manually o tedious + infeasible? Use a global constant to fill in the missing value o e.g., unknown, a new class?! Data mining: KDD Process 16 8

9 How to Handle Missing Data? (2) Use the attribute mean to fill in the missing value Use the attribute mean for all samples belonging to the same class to fill in the missing value o smarter solution than using the general attribute mean Use the most probable value to fill in the missing value o inference-based tools such as decision tree induction or a Bayesian formalism o regression Data mining: KDD Process 17 Noisy Data Noise: random error or variance in a measured variable Incorrect attribute values may due to o faulty data collection instruments o data entry problems o data transmission problems o technology limitation o inconsistency in naming convention Data mining: KDD Process 18 9

10 How to Handle Noisy Data? Binning o smooth a sorted data value by looking at the values around it Clustering o detect and remove outliers Combined computer and human inspection o detect suspicious values and check by human Regression o smooth by fitting the data into regression functions Data mining: KDD Process 19 Binning methods (1) Equal-depth (frequency) partitioning o sort data and partition into bins, N intervals, each containing approximately same number of samples o smooth by bin means, bin median, bin boundaries, etc. o good data scaling o managing categorical attributes can be tricky Data mining: KDD Process 20 10

11 Binning methods (2) Equal-width (distance) partitioning o divide the range into N intervals of equal size: uniform grid o if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N. o the most straightforward o outliers may dominate presentation o skewed data is not handled well Data mining: KDD Process 21 Equal-depth binning - Example Sorted data for price (in dollars): o 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 Partition into (equal-depth) bins: o Bin 1: 4, 8, 9, 15 o Bin 2: 21, 21, 24, 25 o Bin 3: 26, 28, 29, 34 Smoothing by bin means: o Bin 1: 9, 9, 9, 9 o Bin 2: 23, 23, 23, 23 o Bin 3: 29, 29, 29, 29 by bin boundaries: o Bin 1: 4, 4, 4, 15 o Bin 2: 21, 21, 25, 25 o Bin 3: 26, 26, 26, Data mining: KDD Process 22 11

12 Data Integration (1) Data integration o combines data from multiple sources into a coherent store Schema integration o integrate metadata from different sources o entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id B.cust-# Data mining: KDD Process 23 Data Integration (2) Detecting and resolving data value conflicts o for the same real world entity, attribute values from different sources are different o possible reasons: different representations, different scales, e.g., metric vs. British units Data mining: KDD Process 24 12

13 Handling Redundant Data Redundant data occur often, when multiple databases are integrated o the same attribute may have different names in different databases o one attribute may be a derived attribute in another table, e.g., annual revenue Redundant data may be detected by correlation analysis Careful integration of data from multiple sources may o help to reduce/avoid redundancies and inconsistencies o improve mining speed and quality Data mining: KDD Process 25 Data Transformation Smoothing: remove noise from data Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: scaled to fall within a small, specified range, e.g., o min-max normalization o normalization by decimal scaling Attribute/feature construction o new attributes constructed from the given ones Data mining: KDD Process 26 13

14 Data Reduction Data reduction o obtains a reduced representation of the data set that is much smaller in volume o produces the same (or almost the same) analytical results as the original data Data reduction strategies o dimensionality reduction o numerosity reduction o discretization and concept hierarchy generation Data mining: KDD Process 27 Dimensionality Reduction Feature selection (i.e., attribute subset selection): o select a minimum set of features such that the probability distribution of different classes given the values for those features is as close as possible to the original distribution given the values of all features o reduce the number of patterns in the patterns, easier to understand Heuristic methods (due to exponential # of choices): o step-wise forward selection o step-wise backward elimination o combining forward selection and backward elimination Data mining: KDD Process 28 14

15 Dimensionality Reduction - Example Initial attribute set: {A1, A2, A3, A4, A5, A6} A4? A1? A6? Class 1 Class 2 Class 1 Class 2 > Reduced attribute set: {A1, A4, A6} Data mining: KDD Process 29 Numerosity Reduction Parametric methods o assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers) o e.g., regression analysis, log-linear models Non-parametric methods o do not assume models o e.g., histograms, clustering, sampling Data mining: KDD Process 30 15

16 Discretization Reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Some classification algorithms only accept categorical attributes Data mining: KDD Process 31 Concept Hierarchies Reduce the data by collecting and replacing low level concepts by higher level concepts For example, replace numeric values for the attribute age by more general values young, middle-aged, or senior Data mining: KDD Process 32 16

17 Discretization and concept hierarchy generation for numeric data Binning Histogram analysis Clustering analysis Entropy-based discretization Segmentation by natural partitioning Data mining: KDD Process 33 Concept hierarchy generation for categorical data Specification of a partial ordering of attributes explicitly at the schema level by users or experts Specification of a portion of a hierarchy by explicit data grouping Specification of a set of attributes, but not of their partial ordering Specification of only a partial set of attributes Data mining: KDD Process 34 17

18 Specification of a set of attributes Concept hierarchy can be automatically generated based on the number of distinct values per attribute in the given attribute set. The attribute with the most distinct values is placed at the lowest level of the hierarchy. country province_or_ state city street 15 distinct values 65 distinct values 3567 distinct values distinct values Data mining: KDD Process 35 Post-processing - overview Post-processing Why data postprocessing? Interestingness Visualization Utilization Data mining: KDD Process 36 18

19 Why data post-processing? (1) Aim: to show the results, or more precisely the most interesting findings, of the data mining phase to a user/users in an understandable way A possible post-processing methodology: o find all potentially interesting patterns according to some rather loose criteria o provide flexible methods for iteratively and interactively creating different views of the discovered patterns Other more restrictive or focused methodologies possible as well Data mining: KDD Process 37 Why data post-processing? (2) A post-processing methodology is useful, if o the desired focus is not known in advance (the search process cannot be optimized to look only for the interesting patterns) o there is an algorithm that can produce all patterns from a class of potentially interesting patterns (the result is complete) o the time requirement for discovering all potentially interesting patterns is not considerably longer than, if the discovery was focused to a small subset of potentially interesting patterns Data mining: KDD Process 38 19

20 Are all the discovered pattern interesting? A data mining system/query may generate thousands of patterns, but are they all interesting? Usually NOT! How could we then choose the interesting patterns? => Interestingness Data mining: KDD Process 39 Interestingness criteria (1) Some possible criteria for interestingness: o evidence: statistical significance of finding? o redundancy: similarity between findings? o usefulness: meeting the user's needs/goals? o novelty: already prior knowledge? o simplicity: syntactical complexity? o generality: how many examples covered? Data mining: KDD Process 40 20

21 Interestingness criteria(2) One division of interestingness criteria: o objective measures that are based on statistics and structures of patterns, e.g., J-measure: statistical significance certainty factor: support or frequency strength: confidence o subjective measures that are based on user s beliefs in the data, e.g., unexpectedness: is the found pattern surprising?" actionability: can I do something with it?" Data mining: KDD Process 41 Criticism: Support & Confidence Example: (Aggarwal & Yu, PODS98) o among 5000 students 3000 play basketball, 3750 eat cereal 2000 both play basket ball and eat cereal o the rule play basketball eat cereal [40%, 66.7%] is misleading, because the overall percentage of students eating cereal is 75%, which is higher than 66.7% o the rule play basketball not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence Data mining: KDD Process 42 21

22 Interest Yet another objective measure for interestingness is interest that is defined as P( A B) P( A) P( B) Properties of this measure: o takes both P(A) and P(B) in consideration: o P(A^B)=P(B)*P(A), if A and B are independent events o A and B negatively correlated, if the value is less than 1; otherwise A and B positively correlated Data mining: KDD Process 43 Also J-measure J-measure J measure (1 conf ( A = conf ( A) ( conf ( A B) ) 1 conf ( A B) B) log( 1 conf ( B) conf ( A B) log + conf ( B) )) is an objective measure for interestingness Properties of J-measure: o again, takes both P(A) and P(B) in consideration o value is always between 0 and 1 o can be computed using pre-calculated values Data mining: KDD Process 44 22

23 Support/Frequency/J-measure Rules ,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 Threshold Dataset 1 Dataset 2 Dataset 3 Dataset 4 Dataset 5 Dataset Data mining: KDD Process 45 Confidence Rules ,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 Confidence threshold Dataset 1 Dataset 2 Dataset 3 Dataset 4 Dataset 5 Dataset Data mining: KDD Process 46 23

24 Example Selection of Interesting Association Rules For reducing the number of association rules that have to be considered, we could, for example, use one of the following selection criteria: o frequency and confidence o J-measure or interest o maximum rule size (whole rule, left-hand side, right-hand side) o rule attributes (e.g., templates) Data mining: KDD Process 47 Example Problems with selection of rules A rule can correspond to prior knowledge or expectations o how to encode the background knowledge into the system? A rule can refer to uninteresting attributes or attribute combinations o could this be avoided by enhancing the preprocessing phase? Rules can be redundant o redundancy elimination by rule covers etc Data mining: KDD Process 48 24

25 Interpretation and evaluation of the results of data mining Evaluation o statistical validation and significance testing o qualitative review by experts in the field o pilot surveys to evaluate model accuracy Interpretation o tree and rule models can be read directly o clustering results can be graphed and tabled o code can be automatically generated by some systems Data mining: KDD Process 49 Visualization of Discovered Patterns (1) In some cases, visualization of the results of data mining (rules, clusters, networks ) can be very helpful Visualization is actually already important in the preprocessing phase in selecting the appropriate data or in looking at the data Visualization requires training and practice Data mining: KDD Process 50 25

26 Visualization of Discovered Patterns (2) Different backgrounds/usages may require different forms of representation o e.g., rules, tables, cross-tabulations, or pie/bar chart Concept hierarchy is also important o discovered knowledge might be more understandable when represented at high level of abstraction o interactive drill up/down, pivoting, slicing and dicing provide different perspective to data Different kinds of knowledge require different kinds of representation o association, classification, clustering, etc Data mining: KDD Process 51 Visualization Data mining: KDD Process 52 26

27 Data mining: KDD Process 53 Utilization of the results Increasing potential to support business decisions Making Decisions Data Presentation Visualization Techniques Data Mining Information Discovery End User Business Analyst Data Analyst Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA Data Sources Paper, Files, Information Providers, Database Systems, OLTP DBA Data mining: KDD Process 54 27

28 Summary Data mining: semi-automatic discovery of interesting patterns from large data sets Knowledge discovery is a process: o preprocessing o data mining o post-processing o using and utilizing the knowledge Data mining: KDD Process 55 Summary Preprocessing is important in order to get useful results! If a loosely defined mining methodology is used, postprocessing is needed in order to find the interesting results! Visualization is useful in preand post-processing! One has to be able to utilize the found knowledge! Data mining: KDD Process 56 28

29 References KDD Process P. Adriaans and D. Zantinge. Data Mining. Addison-Wesley: Harlow, England, R.J. Brachman, T. Anand. The process of knowledge discovery in databases. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Communications of ACM, 42:73-78, M. S. Chen, J. Han, and P. S. Yu. Data mining: An overview from a database perspective. IEEE Trans. Knowledge and Data Engineering, 8: , U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, T. Imielinski and H. Mannila. A database perspective on knowledge discovery. Communications of ACM, 39:58-64, Jagadish et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical Committee on Data Engineering, 20(4), December D. Keim, Visual techniques for exploring databases. Tutorial notes in KDD 97, Newport Beach, CA, USA, D. Keim, Visual data mining. Tutorial notes in VLDB 97, Athens, Greece, D. Keim, and H.P. Krieger, Visual techniques for mining large databases: a comparison. IEEE Transactions on Knowledge and Data Engineering, 8(6), Data mining: KDD Process 57 References KDD Process M. Klemettinen, A knowledge discovery methodology for telecommunication network alarm databases. Ph.D. thesis, University of Helsinki, Report A , M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A.I. Verkamo. Finding interesting rules from large sets of discovered association rules. CIKM 94, Gaithersburg, Maryland, Nov G. Piatetsky-Shapiro, U. Fayyad, and P. Smith. From data mining to knowledge discovery: An overview. In U.M. Fayyad, et al. (eds.), Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, T. Redman. Data Quality: Management and Technology. Bantam Books, New York, A. Silberschatz and A. Tuzhilin. What makes patterns interesting in knowledge discovery systems. IEEE Trans. on Knowledge and Data Engineering, 8: , Dec D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S. Nestorov. Query flocks: A generalization of association-rule mining. SIGMOD'98, Seattle, Washington, June Data mining: KDD Process 58 29

30 References KDD Process Y. Wand and R. Wang. Anchoring data quality dimensions ontological foundations. Communications of ACM, 39:86-95, R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans. Knowledge and Data Engineering, 7: , Data mining: KDD Process 59 Reminder: Course Organization Course Evaluation Passing the course: min 30 points o home exam: min 13 points (max 30 points) o exercises/experiments: min 8 points (max 20 points) at least 3 returned and reported experiments o group presentation: min 4 points (max 10 points) Remember also the other requirements: o attending the lectures (5/7) o attending the seminars (4/5) o attending the exercises (4/5) Data mining: KDD Process 60 30

31 Seminar Presentations/Groups 9-10 Visualization and data mining D. Keim, H.P., Kriegel, T. Seidl: Supporting Data Mining of Large Databases by Visual Feedback Queries", ICDE Data mining: KDD Process 61 Seminar Presentations/Groups 9-10 Interestingness G. Piatetsky-Shapiro, C.J. Matheus: The Interestingness of Deviations, KDD Data mining: KDD Process 62 31

32 KDD process Thanks to Jiawei Han from Simon Fraser University and Mika Klemettinen from Nokia Research Center for their slides which greatly helped in preparing this lecture! Also thanks to Fosca Giannotti and Dino Pedreschi from Pisa for their slides Data mining: KDD Process 63 32

Data Preprocessing. Slides by: Shree Jaswal

Data Preprocessing. Slides by: Shree Jaswal Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data

More information

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska Preprocessing Short Lecture Notes cse352 Professor Anita Wasilewska Data Preprocessing Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept

More information

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation = noisy: containing

More information

UNIT 2 Data Preprocessing

UNIT 2 Data Preprocessing UNIT 2 Data Preprocessing Lecture Topic ********************************************** Lecture 13 Why preprocess the data? Lecture 14 Lecture 15 Lecture 16 Lecture 17 Data cleaning Data integration and

More information

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha Data Preprocessing S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha 1 Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3 Data Mining: Concepts and Techniques (3 rd ed.) Chapter 3 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University 2011 Han, Kamber & Pei. All rights

More information

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA Obj ti Objectives Motivation: Why preprocess the Data? Data Preprocessing Techniques Data Cleaning Data Integration and Transformation Data Reduction Data Preprocessing Lecture 3/DMBI/IKI83403T/MTI/UI

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing Data Mining: Concepts and Techniques (3 rd ed.) Chapter 3 1 Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major Tasks in Data Preprocessing Data Cleaning Data Integration Data

More information

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Data preprocessing Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 15 Table of contents 1 Introduction 2 Data preprocessing

More information

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1394 Data Mining Data preprocessing Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 15 Table of contents 1 Introduction 2 Data preprocessing

More information

UNIT 2. DATA PREPROCESSING AND ASSOCIATION RULES

UNIT 2. DATA PREPROCESSING AND ASSOCIATION RULES UNIT 2. DATA PREPROCESSING AND ASSOCIATION RULES Data Pre-processing-Data Cleaning, Integration, Transformation, Reduction, Discretization Concept Hierarchies-Concept Description: Data Generalization And

More information

2. Data Preprocessing

2. Data Preprocessing 2. Data Preprocessing Contents of this Chapter 2.1 Introduction 2.2 Data cleaning 2.3 Data integration 2.4 Data transformation 2.5 Data reduction Reference: [Han and Kamber 2006, Chapter 2] SFU, CMPT 459

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES 2: Data Pre-Processing Instructor: Yizhou Sun yzsun@ccs.neu.edu September 10, 2013 2: Data Pre-Processing Getting to know your data Basic Statistical Descriptions of Data

More information

3. Data Preprocessing. 3.1 Introduction

3. Data Preprocessing. 3.1 Introduction 3. Data Preprocessing Contents of this Chapter 3.1 Introduction 3.2 Data cleaning 3.3 Data integration 3.4 Data transformation 3.5 Data reduction SFU, CMPT 740, 03-3, Martin Ester 84 3.1 Introduction Motivation

More information

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4 Principles of Knowledge Discovery in Data Fall 2004 Chapter 3: Data Preprocessing Dr. Osmar R. Zaïane University of Alberta Summary of Last Chapter What is a data warehouse and what is it for? What is

More information

Data Preprocessing. Data Mining 1

Data Preprocessing. Data Mining 1 Data Preprocessing Today s real-world databases are highly susceptible to noisy, missing, and inconsistent data due to their typically huge size and their likely origin from multiple, heterogenous sources.

More information

By Mahesh R. Sanghavi Associate professor, SNJB s KBJ CoE, Chandwad

By Mahesh R. Sanghavi Associate professor, SNJB s KBJ CoE, Chandwad By Mahesh R. Sanghavi Associate professor, SNJB s KBJ CoE, Chandwad Data Analytics life cycle Discovery Data preparation Preprocessing requirements data cleaning, data integration, data reduction, data

More information

ECT7110. Data Preprocessing. Prof. Wai Lam. ECT7110 Data Preprocessing 1

ECT7110. Data Preprocessing. Prof. Wai Lam. ECT7110 Data Preprocessing 1 ECT7110 Data Preprocessing Prof. Wai Lam ECT7110 Data Preprocessing 1 Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest,

More information

CS 521 Data Mining Techniques Instructor: Abdullah Mueen

CS 521 Data Mining Techniques Instructor: Abdullah Mueen CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 2: DATA TRANSFORMATION AND DIMENSIONALITY REDUCTION Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major Tasks

More information

Data Mining Concepts & Techniques

Data Mining Concepts & Techniques Data Mining Concepts & Techniques Lecture No. 02 Data Processing, Data Mining Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology

More information

Data Preprocessing in Python. Prof.Sushila Aghav

Data Preprocessing in Python. Prof.Sushila Aghav Data Preprocessing in Python Prof.Sushila Aghav Sushila.aghav@mitcoe.edu.in Content Why preprocess the data? Descriptive data summarization Data cleaning Data integration and transformation April 24, 2018

More information

Jarek Szlichta

Jarek Szlichta Jarek Szlichta http://data.science.uoit.ca/ Open data Business Data Web Data Available at different formats 2 Data Scientist: The Sexiest Job of the 21 st Century Harvard Business Review Oct. 2012 (c)

More information

Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Data Mining: Concepts and Techniques Slides for Textbook Chapter 4 October 17, 2006 Data Mining: Concepts and Techniques 1 Chapter 4: Data Mining Primitives, Languages, and System Architectures Data mining

More information

CS570: Introduction to Data Mining

CS570: Introduction to Data Mining CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu, Ph.D. Some slides courtesy of Li Xiong, Ph.D. and 2011 Han, Kamber & Pei. Data Mining. Morgan Kaufmann.

More information

Data Preprocessing. Outline. Motivation. How did this happen?

Data Preprocessing. Outline. Motivation. How did this happen? Outline Data Preprocessing Motivation Data cleaning Data integration and transformation Data reduction Discretization and hierarchy generation Summary CS 5331 by Rattikorn Hewett Texas Tech University

More information

Data Mining: Concepts and Techniques. Chapter 2

Data Mining: Concepts and Techniques. Chapter 2 Data Mining: Concepts and Techniques Chapter 2 Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj 2006 Jiawei Han and Micheline Kamber, All rights

More information

K236: Basis of Data Science

K236: Basis of Data Science Schedule of K236 K236: Basis of Data Science Lecture 6: Data Preprocessing Lecturer: Tu Bao Ho and Hieu Chi Dam TA: Moharasan Gandhimathi and Nuttapong Sanglerdsinlapachai 1. Introduction to data science

More information

Data Preprocessing. Komate AMPHAWAN

Data Preprocessing. Komate AMPHAWAN Data Preprocessing Komate AMPHAWAN 1 Data cleaning (data cleansing) Attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data. 2 Missing value

More information

cse634 Data Mining Preprocessing Lecture Notes Chapter 2 Professor Anita Wasilewska

cse634 Data Mining Preprocessing Lecture Notes Chapter 2 Professor Anita Wasilewska cse634 Data Mining Preprocessing Lecture Notes Chapter 2 Professor Anita Wasilewska Chapter 2: Data Preprocessing (book slide) Why preprocess the data? Descriptive data summarization Data cleaning Data

More information

Data Preprocessing. Erwin M. Bakker & Stefan Manegold. https://homepages.cwi.nl/~manegold/dbdm/

Data Preprocessing. Erwin M. Bakker & Stefan Manegold. https://homepages.cwi.nl/~manegold/dbdm/ Data Preprocessing Erwin M. Bakker & Stefan Manegold https://homepages.cwi.nl/~manegold/dbdm/ http://liacs.leidenuniv.nl/~bakkerem2/dbdm/ s.manegold@liacs.leidenuniv.nl e.m.bakker@liacs.leidenuniv.nl 9/26/17

More information

Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Data Mining: Concepts and Techniques Chapter 2 Original Slides: Jiawei Han and Micheline Kamber Modification: Li Xiong Data Mining: Concepts and Techniques 1 Chapter 2: Data Preprocessing Why preprocess

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3 Data Mining: Concepts and Techniques (3 rd ed.) Chapter 3 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University 2013 Han, Kamber & Pei. All rights

More information

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation Data Mining Part 2. Data Understanding and Preparation 2.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Normalization Attribute Construction Aggregation Attribute Subset Selection Discretization

More information

Data preprocessing Functional Programming and Intelligent Algorithms

Data preprocessing Functional Programming and Intelligent Algorithms Data preprocessing Functional Programming and Intelligent Algorithms Que Tran Høgskolen i Ålesund 20th March 2017 1 Why data preprocessing? Real-world data tend to be dirty incomplete: lacking attribute

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3 Data Mining: Concepts and Techniques (3 rd ed.) Chapter 3 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University 2013 Han, Kamber & Pei. All rights

More information

Data Collection, Preprocessing and Implementation

Data Collection, Preprocessing and Implementation Chapter 6 Data Collection, Preprocessing and Implementation 6.1 Introduction Data collection is the loosely controlled method of gathering the data. Such data are mostly out of range, impossible data combinations,

More information

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Database and Knowledge-Base Systems: Data Mining. Martin Ester Database and Knowledge-Base Systems: Data Mining Martin Ester Simon Fraser University School of Computing Science Graduate Course Spring 2006 CMPT 843, SFU, Martin Ester, 1-06 1 Introduction [Fayyad, Piatetsky-Shapiro

More information

A Survey on Data Preprocessing Techniques for Bioinformatics and Web Usage Mining

A Survey on Data Preprocessing Techniques for Bioinformatics and Web Usage Mining Volume 117 No. 20 2017, 785-794 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu A Survey on Data Preprocessing Techniques for Bioinformatics and Web

More information

Data Mining and Analytics. Introduction

Data Mining and Analytics. Introduction Data Mining and Analytics Introduction Data Mining Data mining refers to extracting or mining knowledge from large amounts of data It is also termed as Knowledge Discovery from Data (KDD) Mostly, data

More information

Data Mining: Concepts and Techniques. Chapter 2

Data Mining: Concepts and Techniques. Chapter 2 Data Mining: Concepts and Techniques Chapter 2 Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj 2006 Jiawei Han and Micheline Kamber, All rights

More information

Information Management course

Information Management course Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 03 : 13/10/2015 Data Mining: Concepts and Techniques (3 rd ed.) Chapter

More information

Chapter 2 Data Preprocessing

Chapter 2 Data Preprocessing Chapter 2 Data Preprocessing CISC4631 1 Outline General data characteristics Data cleaning Data integration and transformation Data reduction Summary CISC4631 2 1 Types of Data Sets Record Relational records

More information

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATIONS (MCA) Semester: IV

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATIONS (MCA) Semester: IV GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATIONS (MCA) Semester: IV Subject Name: Elective I Data Warehousing & Data Mining (DWDM) Subject Code: 2640005 Learning Objectives: To understand

More information

Road Map. Data types Measuring data Data cleaning Data integration Data transformation Data reduction Data discretization Summary

Road Map. Data types Measuring data Data cleaning Data integration Data transformation Data reduction Data discretization Summary 2. Data preprocessing Road Map Data types Measuring data Data cleaning Data integration Data transformation Data reduction Data discretization Summary 2 Data types Categorical vs. Numerical Scale types

More information

DATA WAREHOUING UNIT I

DATA WAREHOUING UNIT I BHARATHIDASAN ENGINEERING COLLEGE NATTRAMAPALLI DEPARTMENT OF COMPUTER SCIENCE SUB CODE & NAME: IT6702/DWDM DEPT: IT Staff Name : N.RAMESH DATA WAREHOUING UNIT I 1. Define data warehouse? NOV/DEC 2009

More information

Chapter 1, Introduction

Chapter 1, Introduction CSI 4352, Introduction to Data Mining Chapter 1, Introduction Young-Rae Cho Associate Professor Department of Computer Science Baylor University What is Data Mining? Definition Knowledge Discovery from

More information

DATA PREPROCESSING. Tzompanaki Katerina

DATA PREPROCESSING. Tzompanaki Katerina DATA PREPROCESSING Tzompanaki Katerina Background: Data storage formats Data in DBMS ODBC, JDBC protocols Data in flat files Fixed-width format (each column has a specific number of characters, filled

More information

Data Mining Course Overview

Data Mining Course Overview Data Mining Course Overview 1 Data Mining Overview Understanding Data Classification: Decision Trees and Bayesian classifiers, ANN, SVM Association Rules Mining: APriori, FP-growth Clustering: Hierarchical

More information

ECLT 5810 Data Preprocessing. Prof. Wai Lam

ECLT 5810 Data Preprocessing. Prof. Wai Lam ECLT 5810 Data Preprocessing Prof. Wai Lam Why Data Preprocessing? Data in the real world is imperfect incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate

More information

Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering

Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering Abstract Mrs. C. Poongodi 1, Ms. R. Kalaivani 2 1 PG Student, 2 Assistant Professor, Department of

More information

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining. About the Tutorial Data Mining is defined as the procedure of extracting information from huge sets of data. In other words, we can say that data mining is mining knowledge from data. The tutorial starts

More information

An Overview of various methodologies used in Data set Preparation for Data mining Analysis

An Overview of various methodologies used in Data set Preparation for Data mining Analysis An Overview of various methodologies used in Data set Preparation for Data mining Analysis Arun P Kuttappan 1, P Saranya 2 1 M. E Student, Dept. of Computer Science and Engineering, Gnanamani College of

More information

Interestingness Measurements

Interestingness Measurements Interestingness Measurements Objective measures Two popular measurements: support and confidence Subjective measures [Silberschatz & Tuzhilin, KDD95] A rule (pattern) is interesting if it is unexpected

More information

Data Preparation. Data Preparation. (Data pre-processing) Why Prepare Data? Why Prepare Data? Some data preparation is needed for all mining tools

Data Preparation. Data Preparation. (Data pre-processing) Why Prepare Data? Why Prepare Data? Some data preparation is needed for all mining tools Data Preparation Data Preparation (Data pre-processing) Why prepare the data? Discretization Data cleaning Data integration and transformation Data reduction, Feature selection 2 Why Prepare Data? Why

More information

Data Mining. Yi-Cheng Chen ( 陳以錚 ) Dept. of Computer Science & Information Engineering, Tamkang University

Data Mining. Yi-Cheng Chen ( 陳以錚 ) Dept. of Computer Science & Information Engineering, Tamkang University Data Mining Yi-Cheng Chen ( 陳以錚 ) Dept. of Computer Science & Information Engineering, Tamkang University Why Mine Data? Commercial Viewpoint Lots of data is being collected and warehoused Web data, e-commerce

More information

Data Exploration and Preparation Data Mining and Text Mining (UIC Politecnico di Milano)

Data Exploration and Preparation Data Mining and Text Mining (UIC Politecnico di Milano) Data Exploration and Preparation Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining, : Concepts and Techniques", The Morgan Kaufmann

More information

Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Data Mining: Concepts and Techniques Slides for Textbook Chapter 1 Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser University, Canada

More information

Table Of Contents: xix Foreword to Second Edition

Table Of Contents: xix Foreword to Second Edition Data Mining : Concepts and Techniques Table Of Contents: Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments xxxi About the Authors xxxv Chapter 1 Introduction 1 (38) 1.1 Why Data

More information

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University Cse634 DATA MINING TEST REVIEW Professor Anita Wasilewska Computer Science Department Stony Brook University Preprocessing stage Preprocessing: includes all the operations that have to be performed before

More information

New Orleans, Louisiana, February/March Knowledge Discovery from Telecommunication. Network Alarm Databases. K. Hatonen M. Klemettinen H.

New Orleans, Louisiana, February/March Knowledge Discovery from Telecommunication. Network Alarm Databases. K. Hatonen M. Klemettinen H. To appear in the 12th International Conference on Data Engineering (ICDE'96), New Orleans, Louisiana, February/March 1996. Knowledge Discovery from Telecommunication Network Alarm Databases K. Hatonen

More information

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar What is data exploration? A preliminary exploration of the data to better understand its characteristics.

More information

Interestingness Measurements

Interestingness Measurements Interestingness Measurements Objective measures Two popular measurements: support and confidence Subjective measures [Silberschatz & Tuzhilin, KDD95] A rule (pattern) is interesting if it is unexpected

More information

Association Rule Selection in a Data Mining Environment

Association Rule Selection in a Data Mining Environment Association Rule Selection in a Data Mining Environment Mika Klemettinen 1, Heikki Mannila 2, and A. Inkeri Verkamo 1 1 University of Helsinki, Department of Computer Science P.O. Box 26, FIN 00014 University

More information

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining Data Mining: Exploring Data Lecture Notes for Data Exploration Chapter Introduction to Data Mining by Tan, Steinbach, Karpatne, Kumar 02/03/2018 Introduction to Data Mining 1 What is data exploration?

More information

Data Mining: Exploring Data. Lecture Notes for Chapter 3

Data Mining: Exploring Data. Lecture Notes for Chapter 3 Data Mining: Exploring Data Lecture Notes for Chapter 3 1 What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include

More information

SCHEME OF COURSE WORK. Data Warehousing and Data mining

SCHEME OF COURSE WORK. Data Warehousing and Data mining SCHEME OF COURSE WORK Course Details: Course Title Course Code Program: Specialization: Semester Prerequisites Department of Information Technology Data Warehousing and Data mining : 15CT1132 : B.TECH

More information

Handling Missing Values via Decomposition of the Conditioned Set

Handling Missing Values via Decomposition of the Conditioned Set Handling Missing Values via Decomposition of the Conditioned Set Mei-Ling Shyu, Indika Priyantha Kuruppu-Appuhamilage Department of Electrical and Computer Engineering, University of Miami Coral Gables,

More information

Discovering interesting rules from financial data

Discovering interesting rules from financial data Discovering interesting rules from financial data Przemysław Sołdacki Institute of Computer Science Warsaw University of Technology Ul. Andersa 13, 00-159 Warszawa Tel: +48 609129896 email: psoldack@ii.pw.edu.pl

More information

What is Data Mining?

What is Data Mining? Introduction What is Data Mining? Data Mining: Concepts and Techniques Slides for Course Data Mining Chapter 1 Jiawei Han 1 Necessity Is the Mother of Invention Data explosion problem Automated data collection

More information

To Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set

To Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set To Enhance Scalability of Item Transactions by Parallel and Partition using Dynamic Data Set Priyanka Soni, Research Scholar (CSE), MTRI, Bhopal, priyanka.soni379@gmail.com Dhirendra Kumar Jha, MTRI, Bhopal,

More information

CS570 Introduction to Data Mining

CS570 Introduction to Data Mining CS570 Introduction to Data Mining Department of Mathematics and Computer Science Li Xiong Data Exploration and Data Preprocessing Data and attributes Data exploration Data pre-processing 2 10 What is Data?

More information

CT75 (ALCCS) DATA WAREHOUSING AND DATA MINING JUN

CT75 (ALCCS) DATA WAREHOUSING AND DATA MINING JUN Q.1 a. Define a Data warehouse. Compare OLTP and OLAP systems. Data Warehouse: A data warehouse is a subject-oriented, integrated, time-variant, and 2 Non volatile collection of data in support of management

More information

Question Bank. 4) It is the source of information later delivered to data marts.

Question Bank. 4) It is the source of information later delivered to data marts. Question Bank Year: 2016-2017 Subject Dept: CS Semester: First Subject Name: Data Mining. Q1) What is data warehouse? ANS. A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile

More information

INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad

INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad - 500 043 INFORMATION TECHNOLOGY DEFINITIONS AND TERMINOLOGY Course Name : DATA WAREHOUSING AND DATA MINING Course Code : AIT006 Program

More information

Reducing Redundancy in Characteristic Rule Discovery by Using IP-Techniques

Reducing Redundancy in Characteristic Rule Discovery by Using IP-Techniques Reducing Redundancy in Characteristic Rule Discovery by Using IP-Techniques Tom Brijs, Koen Vanhoof and Geert Wets Limburg University Centre, Faculty of Applied Economic Sciences, B-3590 Diepenbeek, Belgium

More information

9. Conclusions. 9.1 Definition KDD

9. Conclusions. 9.1 Definition KDD 9. Conclusions Contents of this Chapter 9.1 Course review 9.2 State-of-the-art in KDD 9.3 KDD challenges SFU, CMPT 740, 03-3, Martin Ester 419 9.1 Definition KDD [Fayyad, Piatetsky-Shapiro & Smyth 96]

More information

Dynamic Data in terms of Data Mining Streams

Dynamic Data in terms of Data Mining Streams International Journal of Computer Science and Software Engineering Volume 1, Number 1 (2015), pp. 25-31 International Research Publication House http://www.irphouse.com Dynamic Data in terms of Data Mining

More information

Dta Mining and Data Warehousing

Dta Mining and Data Warehousing CSCI645 Fall 23 Dta Mining and Data Warehousing Instructor: Qigang Gao, Office: CS219, Tel:494-3356, Email: qggao@cs.dal.ca Teaching Assistant: Christopher Jordan, Email: cjordan@cs.dal.ca Office Hours:

More information

Data Mining. Chapter 1: Introduction. Adapted from materials by Jiawei Han, Micheline Kamber, and Jian Pei

Data Mining. Chapter 1: Introduction. Adapted from materials by Jiawei Han, Micheline Kamber, and Jian Pei Data Mining Chapter 1: Introduction Adapted from materials by Jiawei Han, Micheline Kamber, and Jian Pei 1 Any Question? Just Ask 3 Chapter 1. Introduction Why Data Mining? What Is Data Mining? A Multi-Dimensional

More information

CS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University

CS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University CS377: Database Systems Data Warehouse and Data Mining Li Xiong Department of Mathematics and Computer Science Emory University 1 1960s: Evolution of Database Technology Data collection, database creation,

More information

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Contents. Foreword to Second Edition. Acknowledgments About the Authors Contents Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments About the Authors xxxi xxxv Chapter 1 Introduction 1 1.1 Why Data Mining? 1 1.1.1 Moving toward the Information Age 1

More information

Association Rules Mining:References

Association Rules Mining:References Association Rules Mining:References Zhou Shuigeng March 26, 2006 AR Mining References 1 References: Frequent-pattern Mining Methods R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm

More information

Data Mining. Introduction. Piotr Paszek. (Piotr Paszek) Data Mining DM KDD 1 / 44

Data Mining. Introduction. Piotr Paszek. (Piotr Paszek) Data Mining DM KDD 1 / 44 Data Mining Piotr Paszek piotr.paszek@us.edu.pl Introduction (Piotr Paszek) Data Mining DM KDD 1 / 44 Plan of the lecture 1 Data Mining (DM) 2 Knowledge Discovery in Databases (KDD) 3 CRISP-DM 4 DM software

More information

Post-Processing of Discovered Association Rules Using Ontologies

Post-Processing of Discovered Association Rules Using Ontologies Post-Processing of Discovered Association Rules Using Ontologies Claudia Marinica, Fabrice Guillet and Henri Briand LINA Ecole polytechnique de l'université de Nantes, France {claudia.marinica, fabrice.guillet,

More information

Sponsored by AIAT.or.th and KINDML, SIIT

Sponsored by AIAT.or.th and KINDML, SIIT CC: BY NC ND Table of Contents Chapter 2. Data Preprocessing... 31 2.1. Basic Representation for Data: Database Viewpoint... 31 2.2. Data Preprocessing in the Database Point of View... 33 2.3. Data Cleaning...

More information

Data Preprocessing. Chapter Why Preprocess the Data?

Data Preprocessing. Chapter Why Preprocess the Data? Contents 2 Data Preprocessing 3 2.1 Why Preprocess the Data?........................................ 3 2.2 Descriptive Data Summarization..................................... 6 2.2.1 Measuring the Central

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler, Sanjay Ranka

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler, Sanjay Ranka BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler, Sanjay Ranka Topics What is data? Definitions, terminology Types of data and datasets Data preprocessing Data Cleaning Data integration

More information

Keywords Fuzzy, Set Theory, KDD, Data Base, Transformed Database.

Keywords Fuzzy, Set Theory, KDD, Data Base, Transformed Database. Volume 6, Issue 5, May 016 ISSN: 77 18X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Fuzzy Logic in Online

More information

Reduce convention for Large Data Base Using Mathematical Progression

Reduce convention for Large Data Base Using Mathematical Progression Global Journal of Pure and Applied Mathematics. ISSN 0973-1768 Volume 12, Number 4 (2016), pp. 3577-3584 Research India Publications http://www.ripublication.com/gjpam.htm Reduce convention for Large Data

More information

Discovery of Association Rules in Temporal Databases 1

Discovery of Association Rules in Temporal Databases 1 Discovery of Association Rules in Temporal Databases 1 Abdullah Uz Tansel 2 and Necip Fazil Ayan Department of Computer Engineering and Information Science Bilkent University 06533, Ankara, Turkey {atansel,

More information

Cse352 Artifficial Intelligence Short Review for Midterm. Professor Anita Wasilewska Computer Science Department Stony Brook University

Cse352 Artifficial Intelligence Short Review for Midterm. Professor Anita Wasilewska Computer Science Department Stony Brook University Cse352 Artifficial Intelligence Short Review for Midterm Professor Anita Wasilewska Computer Science Department Stony Brook University Midterm Midterm INCLUDES CLASSIFICATION CLASSIFOCATION by Decision

More information

Overview. Data-mining. Commercial & Scientific Applications. Ongoing Research Activities. From Research to Technology Transfer

Overview. Data-mining. Commercial & Scientific Applications. Ongoing Research Activities. From Research to Technology Transfer Data Mining George Karypis Department of Computer Science Digital Technology Center University of Minnesota, Minneapolis, USA. http://www.cs.umn.edu/~karypis karypis@cs.umn.edu Overview Data-mining What

More information

PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore

PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore Data Warehousing Data Mining (17MCA442) 1. GENERAL INFORMATION: PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore 560 100 Department of MCA COURSE INFORMATION SHEET Academic

More information

Data Preprocessing UE 141 Spring 2013

Data Preprocessing UE 141 Spring 2013 Data Preprocessing UE 141 Spring 2013 Jing Gao SUNY Buffalo 1 Outline Data Data Preprocessing Improve data quality Prepare data for analysis Exploring Data Statistics Visualization 2 Document Data Each

More information

Data warehouse and Data Mining

Data warehouse and Data Mining Data warehouse and Data Mining Lecture No. 14 Data Mining and its techniques Naeem A. Mahoto Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology

More information

Data Preprocessing. Data Mining: Concepts and Techniques. c 2012 Elsevier Inc. All rights reserved.

Data Preprocessing. Data Mining: Concepts and Techniques. c 2012 Elsevier Inc. All rights reserved. 3 Data Preprocessing Today s real-world databases are highly susceptible to noisy, missing, and inconsistent data due to their typically huge size (often several gigabytes or more) and their likely origin

More information

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SHRI ANGALAMMAN COLLEGE OF ENGINEERING & TECHNOLOGY (An ISO 9001:2008 Certified Institution) SIRUGANOOR,TRICHY-621105. DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year / Semester: IV/VII CS1011-DATA

More information

AT78 DATA MINING & WAREHOUSING JUN 2015

AT78 DATA MINING & WAREHOUSING JUN 2015 Q.2 a. Where is data mining is used? (4) Applications: 1) Military uses 2) Medical field 3) Business Intelligence 4) Intelligence agencies (security purpose in communication and other fields) 5) Data Retrieval

More information

Information Management course

Information Management course Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 05(b) : 23/10/2012 Data Mining: Concepts and Techniques (3 rd ed.) Chapter

More information

Knowledge Modelling and Management. Part B (9)

Knowledge Modelling and Management. Part B (9) Knowledge Modelling and Management Part B (9) Yun-Heh Chen-Burger http://www.aiai.ed.ac.uk/~jessicac/project/kmm 1 A Brief Introduction to Business Intelligence 2 What is Business Intelligence? Business

More information