Chapter 3: Data Mining:
|
|
- Jack Roberts
- 5 years ago
- Views:
Transcription
1 Chapter 3: Data Mining: 3.1 What is Data Mining? Data Mining is the process of automatically discovering useful information in large repository. Why do we need Data mining? Conventional database systems provide users with query and reporting tools. To some extent the query and reporting tools can assist in answering questions like, where did the largest number of students come from last year? But these tools cannot provide any intelligence about why it happened. Taking an Example of University Database system: o The OLTP system will quickly be able to answer the query like how many students are enrolled in university o The OLAP system using data warehouse will be able to show the trends in students enrollments (ex: how many students are preferring BCA) o Data mining will be able to answer where the university should market. 3.2 Data Mining and Knowledge Discovery: Data mining is an integral part of Knowledge discovery in databases (KDD) it is the process converting raw data into useful information The input data is stored in various formats (flat files, spread sheet or relational tables) The purpose of preprocessing is to transform the raw input data into an appropriate format for subsequent analysis. Lecturer: Syed Khutubuddin Ahmed Contact: khutub27@gmail.com Page 1
2 3.3 Motivating Challenges: Traditional analysis techniques have often faced practical difficulties posed by new data sets. Challenges that motivated the development of data mining: 1) Scalability: Data sets are of the size of Terabytes and Petabytes, if data mining algorithm is handling this massive data it must be scalable, and it need to have parallel distributed algorithms to achieve this. 2) High Dimensionality: Large amount of data always contains thousands of attributes, Complexity increases as the dimensions grows, so the data sets need to have high dimensionality so that it can deal with data containing many dimensions. Traditional data analysis technique can only deal with low dimensional data. 3) Heterogeneous and complex Data: Traditional analysis methods can deal with homogeneous type of attributes; with businesses growing rapidly data mining techniques are required to deal with heterogeneous data 4) Data Ownership and Distribution: as the data is not always stored at one location and it might be scattered at different places in different organization, Distributed data mining technique is required and the challenges faced during this is: 1) How to reduce the amount of communication needed to perform distributed computing 2) How to combine the data mining results obtained from multiple sources 3) How to address data security issues. 5) Non Traditional Analysis: The traditional statistical approach is based on a hypothesized and test paradigm. In other words, a hypothesis is proposed, an experiment is designed to gather the data, and then the data is analyzed with respect to hypothesis. This process is extremely labor-intensive. Lecturer: Syed Khutubuddin Ahmed Contact: khutub27@gmail.com Page 2
3 3.4 The Origin of Data Mining: Draws ideas from machine learning/ai, pattern recognition, statistics, and database systems Traditional Techniques may be unsuitable due to Enormity of data High dimensionality of data Heterogeneous, distributed nature of data 3.5 Data Mining Tasks: Data mining tasks are generally divided into two major categories: Predictive tasks: Use some variables to predict unknown or future values of other variables. Ex: by seeing the behaviour of one variable we can decide the value of other variable. The attribute to be predicted is called: target or dependent Attribute used for making prediction are called: explanatory or independent variable Descriptive tasks: Here the objective is to derive patterns (correlations, anomalies, cluster etc..) that summarize the relationships in data. They are needed post processing the data to validate and explain the results. Cluster Analysis Predictive Modeling Association Analysis Anomaly Detection Lecturer: Syed Khutubuddin Ahmed Contact: Page 3
4 Four of the Core data Mining tasks: 1) Predictive Modeling 2) Association analysis 3) Cluster analysis 4) Anomaly detection 1) Predictive Modeling: refers to the task of building a model for the target variable as a function of the explanatory variable. There are two types of predictive modeling tasks: 1) Classification: used for discrete target variables ex: Web user will make purchase at an online bookstore is a classification task, because the target variable is binary valued. 2) Regression: used for continuous target variables. Ex: forecasting the future price of a stock is regression task because price is a continuous values attribute 2) Association Analysis: useful application of association is to find group of data that have related functionality. The Goal of associating analysis is to extract the most of interesting patterns in an efficient manner. Ex: market based analysis: We may discover the rule that {diapers} {Milk}, which suggests that customers who buy diapers also tend to buy milk. 3) Cluster Analysis: clustering has been used to group sets of related customers. EX: collection of news articles below table shows first 3 rows speak about economy and second 3 lines speak about health sector. A good clustering algorithm should be able to identify these two clusters based on the similarity between words that appear in the article. Example: Article Words Dollar:1, industry:4, country:2, loan:3, government:2 Machinery:2, labor:3, market:4, industry:2, work:3, country:1 Job:5, inflation3, rise:2, jobless:2, market: 3, country:2 Patient:4, symptoms:2, drug:3, health:2, clinic:2, doctor:2 Death:2, cancer:4, drug:3, public:4, health:4, director:1 Medical:2, cost:3, increase:2, patient:2, health:3, care:2 4) Anomaly Detection: is the task of identifying observations whose characteristics are significantly different from the rest of the data. Such observations are known as anomalies or outliers. Lecturer: Syed Khutubuddin Ahmed Contact: khutub27@gmail.com Page 4
5 3.6 Data: Applications of anomalies are: fraud detection, network intrusion, unusual patterns of diseases, and ecosystem disturbances. What is Data? Collection of data objects and their attributes What is an Attribute? An attribute is a property or characteristic of an object Examples: eye color of a person, temperature, etc. Attribute is also known as variable, field, characteristics, or feature What is an Object? A collection of attributes describe an object Object is also known as record, point, case, sample, entity, or instance 3.7 Types of Data 1) Attributes and Measures: Attribute values are numbers or symbols assigned to an attribute Distinction between attributes and attribute values: Same attribute can be mapped to different attribute values Example: height can be measured in feet or meters Different attributes can be mapped to the same set of values Example: Attribute values for ID and age are integers Note: properties of attribute values can be different like you can find the average ages of persons but you cannot find the average ID s Types of Attributes: Nominal (particular identity) Examples: ID numbers, eye color, zip codes Ordinal (measurable) Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short} Interval (range between the two) Examples: calendar dates, temperatures in Celsius or Fahrenheit. Lecturer: Syed Khutubuddin Ahmed Contact: khutub27@gmail.com Page 5
6 Ratio () Examples: temperature in Kelvin, length, time, counts Properties of Attribute Values: The type of an attribute depends on which of the following properties it possesses: Distinctness: = Order: < > Addition: + - Multiplication: * / Nominal attribute: Uses only distinctness Ordinal attribute: Uses distinctness & order Interval attribute: Uses distinctness, order & addition Ratio attribute: Uses all 4 properties Describing Attributes by the nature of Values: Discrete Attribute (Integers) Has only a finite or countably infinite set of values Examples: zip codes, ID Numbers, or the set of words in a collection of documents Often represented as integer variables. Note: binary attributes are a special case of discrete attributes Continuous Attribute (Floating point) Has real numbers as attribute values Examples: temperature, height, or weight. Practically, real values can only be measured and represented using a finite number of digits. Continuous attributes are typically represented as floating-point variables. 3.7 Types of Data Sets Types of data is grouped in to three groups 1) Record data Transaction or market based Data Data Matrix Document Data or sparse data matrix 2) Graph data Data with relationship among objects (World Wide Web Data with objects that are Graphs (Molecular Structures) 3) Ordered data Sequential Data Lecturer: Syed Khutubuddin Ahmed Contact: khutub27@gmail.com Page 6
7 Sequence Data or Genetic Sequence Data Time series or Temporal Data Spatial Data Three characteristics that apply to many data sets are: i) Dimensionality The dimensionality of data set is the number of attributed that the objects in the data set possess. Data with small number of dimensions tends to be qualitatively different than moderate or high dimensional data. The difficulty associated with analyzing high dimensional data are sometimes referred to as the curse of dimensionality ii) Sparsity Data with asymmetric features, most of the attribute values are zero s, in practice terms, sparsity is an advantage because usually only non-zero values to be stored and manipulated. This results in significant savings with respect to computation time and storage. Some of the Data mining algorithms work well for sparse data. iii) Resolution It is frequently possible to obtain data different levels of resolution, often the properties of the data are different at different resolution. Ex: the surface of the earth seems very uneven at a resolution of few meters, but is relatively smooth at a resolution of tens of kilometers. Ex: Photo Pixels (higher the pixel resolution clears the image lesser the resolution image is blurred. Detailed view on three types of data: 1) Record data: Record data set is a collection of data objects, which consists of fixed set of data fields (attributes). Record data is usually stored in flat files Or relational tables Types of record data are: Transaction or market based data The data matrix The sparse data matrix Lecturer: Syed Khutubuddin Ahmed Contact: khutub27@gmail.com Page 7
8 A) Transaction or Market Basket Data: Transaction data is a special type of record data, where each record (transaction) involves a set of items. B) The Data Matrix: If a set of objects have same set numeric attributes then the data objects will be known as points in a multidimensional space. A set of such data objects can be interpreted as M by N matrix. C) The Sparse Data Matrix: This matrix only contains non zero values Ex: fig(d) document term matrix. staff in the enterprise. Lecturer: Syed Khutubuddin Ahmed Contact: khutub27@gmail.com Page 8
9 2) Graph Based data: A Graph can sometime be convenient and powerful representation for data. A. Data with relationships among objects: relationships among objects frequently convey important information ex: web pages B. data with objects that are graphs: If objects have structure, that is the object contains sub objects that have relationship, then such objects are frequently represented as graphs. Ex: benzene molecule 3) Ordered Data: a. Sequential Data: Sequential data also referred as temporal data, can be thought of as an extension of record data, where each record has time associated with it. b. Sequence Data or Genetic Sequence Data: Sequence data consists of a data set that is a sequence of individual entities, such as sequence of words or letters. Lecturer: Syed Khutubuddin Ahmed Contact: khutub27@gmail.com Page 9
10 C. Time Series Data: Time series data is a special type of sequential data in which each record is a time series, i.e. a series of measurements taken over time. D. Spatial Data: Some objects have spatial attributes, such as positions or areas, as well as other types of attributes. An example of spatial data is weather data that is collected for a variety of geographical location. Lecturer: Syed Khutubuddin Ahmed Contact: khutub27@gmail.com Page 10
11 3.8 Data Processing Different Data Processing Techniques are: 1. Aggregation 2. Sampling 3. Dimensionality reduction 4. Feature creation 5. Discretization and binarization 6. Variable transformation 1. Aggregation: Combining two or more attributes (or objects) into a single attribute (or object) Purpose Data reduction Reduce the number of attributes or objects Change of scale Cities aggregated into regions, states, countries, etc More stable data Aggregated data tends to have less variability 2. Sampling Sampling is the main technique employed for data selection. It is often used for both the preliminary investigation of the data and the final data analysis. Statisticians sample because obtaining the entire set of data of interest is too expensive or time consuming. Sampling is used in data mining because processing the entire set of data of interest is too expensive or time consuming. ANALOGY: (Rice : to see whether the rice is cooked or not we only see one particle of it not all the rice particles) Lecturer: Syed Khutubuddin Ahmed Contact: khutub27@gmail.com Page 11
12 Types of Sampling: Simple Random Sampling There is an equal probability of selecting any particular item» There are two variations on random sampling: 1) Sampling without replacement As each item is selected, it is removed from the population 2) Sampling with replacement Objects are not removed from the population as they are selected for the sample. In sampling with replacement, the same object can be picked up more than once Stratified sampling Split the data into several partitions; then draw random samples from each partition Progressive Sampling: If the proper sample size selection is difficult then adaptive or progressive sampling is used. Then these approaches start with a small sample, and then increase the sample size until a sample of sufficient size has been obtained. 3. Dimensionality Reduction: Complexity of data increases as the dimensions grows in data. Curse of Dimensionality: When dimensionality increases, data becomes increasingly sparse in the space that it occupies For clustering, the definitions of density and distance between points, this is critical for clustering and outlier detection, become less meaningful. As a result many clustering and classification algorithms have trouble with high dimensional data, as it results in reduced classification accuracy and poor quality clusters. Purpose: Avoid curse of dimensionality Reduce amount of time and memory required by data mining algorithms Allow data to be more easily visualized May help to eliminate irrelevant features or reduce noise Linear Algebra Techniques for dimensionality reduction: Principle Component Analysis Singular Value Decomposition Others: supervised and non-linear techniques Lecturer: Syed Khutubuddin Ahmed Contact: khutub27@gmail.com Page 12
13 Feature Subset Selection: Another way to reduce dimensionality of data is to use only subset of features. It might seems that such approach would lose information, this is not the case if redundant and irrelevant features are present. Redundant features duplicate much or all of the information contained in one or more other attributes Example: purchase price of a product and the amount of sales tax paid Irrelevant features contain no information that is useful for the data mining task at hand Example: students' ID is often irrelevant to the task of predicting students' Grade Point Averages Techniques for feature selection: Brute-force approach: Try all possible feature subsets as input to data mining algorithm Embedded approaches: Feature selection occurs naturally as part of the data mining algorithm Filter approaches: Features are selected before data mining algorithm is run Wrapper approaches: Use the data mining algorithm as a black box to find best subset of attributes 4. Feature Creation: Create new attributes that can capture the important information in a data set much more efficiently than the original attributes Three general methodologies: Feature Extraction creation of new set of features from the original raw data is known as feature creation Mapping Data to New Space: A totally different view of data that can reveal important and interesting features. Lecturer: Syed Khutubuddin Ahmed Contact: khutub27@gmail.com Page 13
14 Feature Construction combining features to get better features than the original 5. Discretization and Binarization Some data mining algorithms need data to be in the form of categorical attributes. And algorithms that find association patters require that the data be in the form of binary attributes. Thus transforming continuous attributes into a categorical attribute is called discretization. And transforming continuous and discrete attributes into binary attributes is called as Binarization. 6. Variable transformation A variable transformation refers to a transformation that is applied to all the values of a variable, or even attributes. In each object the transformation is applied to the value of the variable for that object. Ex: converting a floating point value to an absolute value. two types of variable transformation: Simple function normalization two types of variable transformation: Simple function: for this type of variable transformation, a simple mathematical function is applied to each value individually. If x is a variable, then example of such transformation include x k, log x, e x, 1/x, sinx or x Normalization: the goal of normalization of standardization is to make an entire set of values have a particular property. Standard deviation is one of the example of standardization where making entire set of values have a common property Lecturer: Syed Khutubuddin Ahmed Contact: khutub27@gmail.com Page 14
15 3.9 Measures of Similarity and Dissimilarity: Basics: Similarity Numerical measure of how alike two data objects are. Is higher when objects are more alike. Often falls in the range [0,1] Dissimilarity Numerical measure of how different are two data objects Lower when objects are more alike Minimum dissimilarity is often 0 Upper limit varies Similarity and Dissimilarity between Simple Attributes: If p and q are the attribute values for two data objects. With respect to ordinal attributes: Consider an attribute that measures the quality of the product: eg: candy bar on the scale {poor, fair, OK, good, wonderful} It would seem reasonable that a product, PI, which is rated wonderful, would be closer to a product P2, which is rated good, than it would be to a product P3, which is rated OK. To make this observation quantitative, the values of the ordinal attribute are often mapped to successive integers, beginning at 0 or 1, e.g., {poor=0, fair=l, OK=2. good=3, wonderful=4]. Lecturer: Syed Khutubuddin Ahmed Contact: khutub27@gmail.com Page 15
16 Then, d ( Pl, P2 ) = 3 2 = 1 or, if we want the dissimilarity to fall between 0 and 1, d ( P l, P 2 ) = (3-2)/5-1 = A similarity for ordinal attributes can then be defined as s = 1- d. Possible Questions from This chapter: 1. What is data mining and why do we need data mining? Ans : Page-1 2. Write a note on Data mining and Knowledge discovery? Ans: Page-1 3. Explain the Challenges that motivated the use of data mining? Ans: page-2 4. Explain the data mining tasks in details? Ans: page-3 to 4 5. Write a note on Data, attribute and Object? Ans:Page-5 6. Explain in detail types of attributes? Ans:Page-5 to 6 7. Explain the different types of data sets in details with proper examples and figures? Ans: page-6 to Explain data processing in details with examples? Ans: Page-11 to Write a note on measures of similarities and dissimilarities? Ans: Page Lecturer: Syed Khutubuddin Ahmed Contact: khutub27@gmail.com Page 16
Data Mining: Data. What is Data? Lecture Notes for Chapter 2. Introduction to Data Mining. Properties of Attribute Values. Types of Attributes
0 Data Mining: Data What is Data? Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Collection of data objects and their attributes An attribute is a property or characteristic
More informationData Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining
10 Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 What is Data? Collection of data objects
More informationData Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining
Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Data Preprocessing Aggregation Sampling Dimensionality Reduction Feature subset selection Feature creation
More informationData Preprocessing. Data Preprocessing
Data Preprocessing Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville ranka@cise.ufl.edu Data Preprocessing What preprocessing step can or should
More informationUniversity of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka
Data Preprocessing Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville ranka@cise.ufl.edu Data Preprocessing What preprocessing step can or should
More informationCSE4334/5334 Data Mining 4 Data and Data Preprocessing. Chengkai Li University of Texas at Arlington Fall 2017
CSE4334/5334 Data Mining 4 Data and Data Preprocessing Chengkai Li University of Texas at Arlington Fall 2017 10 What is Data? Collection of data objects and their attributes Attributes An attribute is
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 2 Sajjad Haider Spring 2010 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric
More informationCAP-359 PRINCIPLES AND APPLICATIONS OF DATA MINING. Rafael Santos
CAP-359 PRINCIPLES AND APPLICATIONS OF DATA MINING Rafael Santos rafael.santos@inpe.br www.lac.inpe.br/~rafael.santos/ Overview So far What is Data Mining? Applications, Examples. Let s think about your
More informationData Mining Concepts & Techniques
Data Mining Concepts & Techniques Lecture No. 02 Data Processing, Data Mining Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology
More informationECLT 5810 Data Preprocessing. Prof. Wai Lam
ECLT 5810 Data Preprocessing Prof. Wai Lam Why Data Preprocessing? Data in the real world is imperfect incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate
More informationProximity and Data Pre-processing
Proximity and Data Pre-processing Slide 1/47 Proximity and Data Pre-processing Huiping Cao Proximity and Data Pre-processing Slide 2/47 Outline Types of data Data quality Measurement of proximity Data
More informationData Preprocessing. Slides by: Shree Jaswal
Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data
More informationQuestion Bank. 4) It is the source of information later delivered to data marts.
Question Bank Year: 2016-2017 Subject Dept: CS Semester: First Subject Name: Data Mining. Q1) What is data warehouse? ANS. A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile
More informationData Preprocessing UE 141 Spring 2013
Data Preprocessing UE 141 Spring 2013 Jing Gao SUNY Buffalo 1 Outline Data Data Preprocessing Improve data quality Prepare data for analysis Exploring Data Statistics Visualization 2 Document Data Each
More informationChapter 1, Introduction
CSI 4352, Introduction to Data Mining Chapter 1, Introduction Young-Rae Cho Associate Professor Department of Computer Science Baylor University What is Data Mining? Definition Knowledge Discovery from
More informationData Mining and Analytics. Introduction
Data Mining and Analytics Introduction Data Mining Data mining refers to extracting or mining knowledge from large amounts of data It is also termed as Knowledge Discovery from Data (KDD) Mostly, data
More informationData Exploration and Preparation Data Mining and Text Mining (UIC Politecnico di Milano)
Data Exploration and Preparation Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining, : Concepts and Techniques", The Morgan Kaufmann
More informationChapter 28. Outline. Definitions of Data Mining. Data Mining Concepts
Chapter 28 Data Mining Concepts Outline Data Mining Data Warehousing Knowledge Discovery in Databases (KDD) Goals of Data Mining and Knowledge Discovery Association Rules Additional Data Mining Algorithms
More informationData can be in the form of numbers, words, measurements, observations or even just descriptions of things.
+ What is Data? Data is a collection of facts. Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. In most cases, data needs to be interpreted and
More informationDatabase and Knowledge-Base Systems: Data Mining. Martin Ester
Database and Knowledge-Base Systems: Data Mining Martin Ester Simon Fraser University School of Computing Science Graduate Course Spring 2006 CMPT 843, SFU, Martin Ester, 1-06 1 Introduction [Fayyad, Piatetsky-Shapiro
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 1 1 Acknowledgement Several Slides in this presentation are taken from course slides provided by Han and Kimber (Data Mining Concepts and Techniques) and Tan,
More informationData Mining: Exploring Data. Lecture Notes for Chapter 3
Data Mining: Exploring Data Lecture Notes for Chapter 3 1 What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include
More informationAnalytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.
Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied
More informationData Mining Course Overview
Data Mining Course Overview 1 Data Mining Overview Understanding Data Classification: Decision Trees and Bayesian classifiers, ANN, SVM Association Rules Mining: APriori, FP-growth Clustering: Hierarchical
More informationData Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining
Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar What is data exploration? A preliminary exploration of the data to better understand its characteristics.
More informationBBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler, Sanjay Ranka
BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler, Sanjay Ranka Topics What is data? Definitions, terminology Types of data and datasets Data preprocessing Data Cleaning Data integration
More informationCS 521 Data Mining Techniques Instructor: Abdullah Mueen
CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 2: DATA TRANSFORMATION AND DIMENSIONALITY REDUCTION Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major Tasks
More informationData warehouse and Data Mining
Data warehouse and Data Mining Lecture No. 14 Data Mining and its techniques Naeem A. Mahoto Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology
More informationData Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining
Data Mining: Exploring Data Lecture Notes for Data Exploration Chapter Introduction to Data Mining by Tan, Steinbach, Karpatne, Kumar 02/03/2018 Introduction to Data Mining 1 What is data exploration?
More informationHomework # 4. Example: Age in years. Answer: Discrete, quantitative, ratio. a) Year that an event happened, e.g., 1917, 1950, 2000.
Homework # 4 1. Attribute Types Classify the following attributes as binary, discrete, or continuous. Further classify the attributes as qualitative (nominal or ordinal) or quantitative (interval or ratio).
More informationCOMP 465 Special Topics: Data Mining
COMP 465 Special Topics: Data Mining Introduction & Course Overview 1 Course Page & Class Schedule http://cs.rhodes.edu/welshc/comp465_s15/ What s there? Course info Course schedule Lecture media (slides,
More informationData Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality
Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation = noisy: containing
More informationMachine Learning Feature Creation and Selection
Machine Learning Feature Creation and Selection Jeff Howbert Introduction to Machine Learning Winter 2012 1 Feature creation Well-conceived new features can sometimes capture the important information
More informationData Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation
Data Mining Part 2. Data Understanding and Preparation 2.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Normalization Attribute Construction Aggregation Attribute Subset Selection Discretization
More informationSummary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4
Principles of Knowledge Discovery in Data Fall 2004 Chapter 3: Data Preprocessing Dr. Osmar R. Zaïane University of Alberta Summary of Last Chapter What is a data warehouse and what is it for? What is
More informationCOMP90049 Knowledge Technologies
COMP90049 Knowledge Technologies Data Mining (Lecture Set 3) 2017 Rao Kotagiri Department of Computing and Information Systems The Melbourne School of Engineering Some of slides are derived from Prof Vipin
More informationBUSINESS DECISION MAKING. Topic 1 Introduction to Statistical Thinking and Business Decision Making Process; Data Collection and Presentation
BUSINESS DECISION MAKING Topic 1 Introduction to Statistical Thinking and Business Decision Making Process; Data Collection and Presentation (Chap 1 The Nature of Probability and Statistics) (Chap 2 Frequency
More informationCS570: Introduction to Data Mining
CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu, Ph.D. Some slides courtesy of Li Xiong, Ph.D. and 2011 Han, Kamber & Pei. Data Mining. Morgan Kaufmann.
More informationTable Of Contents: xix Foreword to Second Edition
Data Mining : Concepts and Techniques Table Of Contents: Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments xxxi About the Authors xxxv Chapter 1 Introduction 1 (38) 1.1 Why Data
More informationJarek Szlichta
Jarek Szlichta http://data.science.uoit.ca/ Open data Business Data Web Data Available at different formats 2 Data Scientist: The Sexiest Job of the 21 st Century Harvard Business Review Oct. 2012 (c)
More informationCS570 Introduction to Data Mining
CS570 Introduction to Data Mining Department of Mathematics and Computer Science Li Xiong Data Exploration and Data Preprocessing Data and attributes Data exploration Data pre-processing 2 10 What is Data?
More informationData Mining Concepts
Data Mining Concepts Outline Data Mining Data Warehousing Knowledge Discovery in Databases (KDD) Goals of Data Mining and Knowledge Discovery Association Rules Additional Data Mining Algorithms Sequential
More informationData Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395
Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 21 Table of contents 1 Introduction 2 Data mining
More information2. Data Preprocessing
2. Data Preprocessing Contents of this Chapter 2.1 Introduction 2.2 Data cleaning 2.3 Data integration 2.4 Data transformation 2.5 Data reduction Reference: [Han and Kamber 2006, Chapter 2] SFU, CMPT 459
More informationData Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA
Obj ti Objectives Motivation: Why preprocess the Data? Data Preprocessing Techniques Data Cleaning Data Integration and Transformation Data Reduction Data Preprocessing Lecture 3/DMBI/IKI83403T/MTI/UI
More informationData Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha
Data Preprocessing S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha 1 Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking
More informationData Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394
Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 20 Table of contents 1 Introduction 2 Data mining
More informationFeature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani
Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate
More informationKnowledge Discovery in Data Bases
Knowledge Discovery in Data Bases Chien-Chung Chan Department of CS University of Akron Akron, OH 44325-4003 2/24/99 1 Why KDD? We are drowning in information, but starving for knowledge John Naisbett
More informationMachine Learning - Clustering. CS102 Fall 2017
Machine Learning - Fall 2017 Big Data Tools and Techniques Basic Data Manipulation and Analysis Performing well-defined computations or asking well-defined questions ( queries ) Data Mining Looking for
More informationCluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1
Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods
More information1. Inroduction to Data Mininig
1. Inroduction to Data Mininig 1.1 Introduction Universe of Data Information Technology has grown in various directions in the recent years. One natural evolutionary path has been the development of the
More information3. Data Preprocessing. 3.1 Introduction
3. Data Preprocessing Contents of this Chapter 3.1 Introduction 3.2 Data cleaning 3.3 Data integration 3.4 Data transformation 3.5 Data reduction SFU, CMPT 740, 03-3, Martin Ester 84 3.1 Introduction Motivation
More information2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.
Code No: M0502/R05 Set No. 1 1. (a) Explain data mining as a step in the process of knowledge discovery. (b) Differentiate operational database systems and data warehousing. [8+8] 2. (a) Briefly discuss
More informationCS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University
CS377: Database Systems Data Warehouse and Data Mining Li Xiong Department of Mathematics and Computer Science Emory University 1 1960s: Evolution of Database Technology Data collection, database creation,
More informationContents. Foreword to Second Edition. Acknowledgments About the Authors
Contents Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments About the Authors xxxi xxxv Chapter 1 Introduction 1 1.1 Why Data Mining? 1 1.1.1 Moving toward the Information Age 1
More informationPreprocessing Short Lecture Notes cse352. Professor Anita Wasilewska
Preprocessing Short Lecture Notes cse352 Professor Anita Wasilewska Data Preprocessing Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept
More informationData Mining: Exploring Data
Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar But we start with a brief discussion of the Friedman article and the relationship between Data
More informationThis tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.
About the Tutorial Data Mining is defined as the procedure of extracting information from huge sets of data. In other words, we can say that data mining is mining knowledge from data. The tutorial starts
More informationDATA MINING AND WAREHOUSING
DATA MINING AND WAREHOUSING Qno Question Answer 1 Define data warehouse? Data warehouse is a subject oriented, integrated, time-variant, and nonvolatile collection of data that supports management's decision-making
More informationCS378 Introduction to Data Mining. Data Exploration and Data Preprocessing. Li Xiong
CS378 Introduction to Data Mining Data Exploration and Data Preprocessing Li Xiong Data Exploration and Data Preprocessing Data and Attributes Data exploration Data pre-processing Data Mining: Concepts
More informationData Mining. Yi-Cheng Chen ( 陳以錚 ) Dept. of Computer Science & Information Engineering, Tamkang University
Data Mining Yi-Cheng Chen ( 陳以錚 ) Dept. of Computer Science & Information Engineering, Tamkang University Why Mine Data? Commercial Viewpoint Lots of data is being collected and warehoused Web data, e-commerce
More informationData Preprocessing. Data Mining 1
Data Preprocessing Today s real-world databases are highly susceptible to noisy, missing, and inconsistent data due to their typically huge size and their likely origin from multiple, heterogenous sources.
More informationDATA ANALYSIS I. Types of Attributes Sparse, Incomplete, Inaccurate Data
DATA ANALYSIS I Types of Attributes Sparse, Incomplete, Inaccurate Data Sources Bramer, M. (2013). Principles of data mining. Springer. [12-21] Witten, I. H., Frank, E. (2011). Data Mining: Practical machine
More informationCS513-Data Mining. Lecture 2: Understanding the Data. Waheed Noor
CS513-Data Mining Lecture 2: Understanding the Data Waheed Noor Computer Science and Information Technology, University of Balochistan, Quetta, Pakistan Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining
More informationDATA MINING II - 1DL460
DATA MINING II - 1DL460 Spring 2016 A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt16 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,
More informationBasic Concepts Weka Workbench and its terminology
Changelog: 14 Oct, 30 Oct Basic Concepts Weka Workbench and its terminology Lecture Part Outline Concepts, instances, attributes How to prepare the input: ARFF, attributes, missing values, getting to know
More informationWeek 7 Picturing Network. Vahe and Bethany
Week 7 Picturing Network Vahe and Bethany Freeman (2005) - Graphic Techniques for Exploring Social Network Data The two main goals of analyzing social network data are identification of cohesive groups
More informationCOMP 465: Data Mining Classification Basics
Supervised vs. Unsupervised Learning COMP 465: Data Mining Classification Basics Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Supervised
More informationDATA WAREHOUING UNIT I
BHARATHIDASAN ENGINEERING COLLEGE NATTRAMAPALLI DEPARTMENT OF COMPUTER SCIENCE SUB CODE & NAME: IT6702/DWDM DEPT: IT Staff Name : N.RAMESH DATA WAREHOUING UNIT I 1. Define data warehouse? NOV/DEC 2009
More informationD B M G Data Base and Data Mining Group of Politecnico di Torino
DataBase and Data Mining Group of Data mining fundamentals Data Base and Data Mining Group of Data analysis Most companies own huge databases containing operational data textual documents experiment results
More informationData mining fundamentals
Data mining fundamentals Elena Baralis Politecnico di Torino Data analysis Most companies own huge bases containing operational textual documents experiment results These bases are a potential source of
More informationIntroduction to Data Mining
Introduction to JULY 2011 Afsaneh Yazdani What motivated? Wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge What motivated? Data
More informationDATA MINING II - 1DL460
DATA MINING II - 1DL460 Spring 2012 A second course in data mining!! http://www.it.uu.se/edu/course/homepage/infoutv2/vt12 Kjell Orsborn! Uppsala Database Laboratory! Department of Information Technology,
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Computer Science 591Y Department of Computer Science University of Massachusetts Amherst February 3, 2005 Topics Tasks (Definition, example, and notes) Classification
More informationData Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Input: Concepts, instances, attributes Terminology What s a concept?
More informationUNIT 2 Data Preprocessing
UNIT 2 Data Preprocessing Lecture Topic ********************************************** Lecture 13 Why preprocess the data? Lecture 14 Lecture 15 Lecture 16 Lecture 17 Data cleaning Data integration and
More informationA Brief Introduction to Data Mining
A Brief Introduction to Data Mining L. Torgo ltorgo@dcc.fc.up.pt Departamento de Ciência de Computadores Faculdade de Ciências / Universidade do Porto Sept, 2014 Introduction Motivation for Data Mining?
More informationIntroduction to Data Mining S L I D E S B Y : S H R E E J A S W A L
Introduction to Data Mining S L I D E S B Y : S H R E E J A S W A L Books 2 Which Chapter from which Text Book? Chapter 1: Introduction from Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann
More informationOverview. Introduction to Data Warehousing and Business Intelligence. BI Is Important. What is Business Intelligence (BI)?
Introduction to Data Warehousing and Business Intelligence Overview Why Business Intelligence? Data analysis problems Data Warehouse (DW) introduction A tour of the coming DW lectures DW Applications Loosely
More informationMachine Learning Chapter 2. Input
Machine Learning Chapter 2. Input 2 Input: Concepts, instances, attributes Terminology What s a concept? Classification, association, clustering, numeric prediction What s in an example? Relations, flat
More informationThe basic arrangement of numeric data is called an ARRAY. Array is the derived data from fundamental data Example :- To store marks of 50 student
Organizing data Learning Outcome 1. make an array 2. divide the array into class intervals 3. describe the characteristics of a table 4. construct a frequency distribution table 5. constructing a composite
More informationExtra readings beyond the lecture slides are important:
1 Notes To preview next lecture: Check the lecture notes, if slides are not available: http://web.cse.ohio-state.edu/~sun.397/courses/au2017/cse5243-new.html Check UIUC course on the same topic. All their
More informationData Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA.
Data Mining Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA January 13, 2011 Important Note! This presentation was obtained from Dr. Vijay Raghavan
More informationCSE4334/5334 DATA MINING
CSE4334/5334 DATA MINING Lecture 4: Classification (1) CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides courtesy
More informationARTIFICIAL INTELLIGENCE (CS 370D)
Princess Nora University Faculty of Computer & Information Systems ARTIFICIAL INTELLIGENCE (CS 370D) (CHAPTER-18) LEARNING FROM EXAMPLES DECISION TREES Outline 1- Introduction 2- know your data 3- Classification
More informationPSS718 - Data Mining
Lecture 5 - Hacettepe University October 23, 2016 Data Issues Improving the performance of a model To improve the performance of a model, we mostly improve the data Source additional data Clean up the
More informationData warehouses Decision support The multidimensional model OLAP queries
Data warehouses Decision support The multidimensional model OLAP queries Traditional DBMSs are used by organizations for maintaining data to record day to day operations On-line Transaction Processing
More informationData Preprocessing. Komate AMPHAWAN
Data Preprocessing Komate AMPHAWAN 1 Data cleaning (data cleansing) Attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data. 2 Missing value
More informationDynamic Data in terms of Data Mining Streams
International Journal of Computer Science and Software Engineering Volume 1, Number 1 (2015), pp. 25-31 International Research Publication House http://www.irphouse.com Dynamic Data in terms of Data Mining
More informationData Statistics Population. Census Sample Correlation... Statistical & Practical Significance. Qualitative Data Discrete Data Continuous Data
Data Statistics Population Census Sample Correlation... Voluntary Response Sample Statistical & Practical Significance Quantitative Data Qualitative Data Discrete Data Continuous Data Fewer vs Less Ratio
More informationBasic Data Mining Technique
Basic Data Mining Technique What is classification? What is prediction? Supervised and Unsupervised Learning Decision trees Association rule K-nearest neighbor classifier Case-based reasoning Genetic algorithm
More informationApplications and Trends in Data Mining
Applications and Trends in Data Mining Data mining applications Data mining system products and research prototypes Additional themes on data mining Social impacts of data mining Trends in data mining
More informationDATA MINING TRANSACTION
DATA MINING Data Mining is the process of extracting patterns from data. Data mining is seen as an increasingly important tool by modern business to transform data into an informational advantage. It is
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2012 http://ce.sharif.edu/courses/90-91/2/ce725-1/ Agenda Features and Patterns The Curse of Size and
More informationBig Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1
Big Data Methods Chapter 5: Machine learning Big Data Methods, Chapter 5, Slide 1 5.1 Introduction to machine learning What is machine learning? Concerned with the study and development of algorithms that
More informationINTRODUCTION TO DATA MINING
INTRODUCTION TO DATA MINING 1 Chiara Renso KDDLab - ISTI CNR, Italy http://www-kdd.isti.cnr.it email: chiara.renso@isti.cnr.it Knowledge Discovery and Data Mining Laboratory, ISTI National Research Council,
More informationData Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing
Data Mining: Concepts and Techniques (3 rd ed.) Chapter 3 1 Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major Tasks in Data Preprocessing Data Cleaning Data Integration Data
More informationData Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier
Data Mining 3.2 Decision Tree Classifier Fall 2008 Instructor: Dr. Masoud Yaghini Outline Introduction Basic Algorithm for Decision Tree Induction Attribute Selection Measures Information Gain Gain Ratio
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2013 http://ce.sharif.edu/courses/91-92/2/ce725-1/ Agenda Features and Patterns The Curse of Size and
More information