Chapter 3: Data Mining:

Size: px

Start display at page:

Download "Chapter 3: Data Mining:"

Jack Roberts
5 years ago
Views:

1 Chapter 3: Data Mining: 3.1 What is Data Mining? Data Mining is the process of automatically discovering useful information in large repository. Why do we need Data mining? Conventional database systems provide users with query and reporting tools. To some extent the query and reporting tools can assist in answering questions like, where did the largest number of students come from last year? But these tools cannot provide any intelligence about why it happened. Taking an Example of University Database system: o The OLTP system will quickly be able to answer the query like how many students are enrolled in university o The OLAP system using data warehouse will be able to show the trends in students enrollments (ex: how many students are preferring BCA) o Data mining will be able to answer where the university should market. 3.2 Data Mining and Knowledge Discovery: Data mining is an integral part of Knowledge discovery in databases (KDD) it is the process converting raw data into useful information The input data is stored in various formats (flat files, spread sheet or relational tables) The purpose of preprocessing is to transform the raw input data into an appropriate format for subsequent analysis. Lecturer: Syed Khutubuddin Ahmed Contact: khutub27@gmail.com Page 1

2 3.3 Motivating Challenges: Traditional analysis techniques have often faced practical difficulties posed by new data sets. Challenges that motivated the development of data mining: 1) Scalability: Data sets are of the size of Terabytes and Petabytes, if data mining algorithm is handling this massive data it must be scalable, and it need to have parallel distributed algorithms to achieve this. 2) High Dimensionality: Large amount of data always contains thousands of attributes, Complexity increases as the dimensions grows, so the data sets need to have high dimensionality so that it can deal with data containing many dimensions. Traditional data analysis technique can only deal with low dimensional data. 3) Heterogeneous and complex Data: Traditional analysis methods can deal with homogeneous type of attributes; with businesses growing rapidly data mining techniques are required to deal with heterogeneous data 4) Data Ownership and Distribution: as the data is not always stored at one location and it might be scattered at different places in different organization, Distributed data mining technique is required and the challenges faced during this is: 1) How to reduce the amount of communication needed to perform distributed computing 2) How to combine the data mining results obtained from multiple sources 3) How to address data security issues. 5) Non Traditional Analysis: The traditional statistical approach is based on a hypothesized and test paradigm. In other words, a hypothesis is proposed, an experiment is designed to gather the data, and then the data is analyzed with respect to hypothesis. This process is extremely labor-intensive. Lecturer: Syed Khutubuddin Ahmed Contact: khutub27@gmail.com Page 2

3.4 The Origin of Data Mining: Draws ideas from machine learning/ai, pattern recognition, statistics, and database systems Traditional Techniques may be unsuitable due to Enormity of data High

3 3.4 The Origin of Data Mining: Draws ideas from machine learning/ai, pattern recognition, statistics, and database systems Traditional Techniques may be unsuitable due to Enormity of data High dimensionality of data Heterogeneous, distributed nature of data 3.5 Data Mining Tasks: Data mining tasks are generally divided into two major categories: Predictive tasks: Use some variables to predict unknown or future values of other variables. Ex: by seeing the behaviour of one variable we can decide the value of other variable. The attribute to be predicted is called: target or dependent Attribute used for making prediction are called: explanatory or independent variable Descriptive tasks: Here the objective is to derive patterns (correlations, anomalies, cluster etc..) that summarize the relationships in data. They are needed post processing the data to validate and explain the results. Cluster Analysis Predictive Modeling Association Analysis Anomaly Detection Lecturer: Syed Khutubuddin Ahmed Contact: Page 3

4 Four of the Core data Mining tasks: 1) Predictive Modeling 2) Association analysis 3) Cluster analysis 4) Anomaly detection 1) Predictive Modeling: refers to the task of building a model for the target variable as a function of the explanatory variable. There are two types of predictive modeling tasks: 1) Classification: used for discrete target variables ex: Web user will make purchase at an online bookstore is a classification task, because the target variable is binary valued. 2) Regression: used for continuous target variables. Ex: forecasting the future price of a stock is regression task because price is a continuous values attribute 2) Association Analysis: useful application of association is to find group of data that have related functionality. The Goal of associating analysis is to extract the most of interesting patterns in an efficient manner. Ex: market based analysis: We may discover the rule that {diapers} {Milk}, which suggests that customers who buy diapers also tend to buy milk. 3) Cluster Analysis: clustering has been used to group sets of related customers. EX: collection of news articles below table shows first 3 rows speak about economy and second 3 lines speak about health sector. A good clustering algorithm should be able to identify these two clusters based on the similarity between words that appear in the article. Example: Article Words Dollar:1, industry:4, country:2, loan:3, government:2 Machinery:2, labor:3, market:4, industry:2, work:3, country:1 Job:5, inflation3, rise:2, jobless:2, market: 3, country:2 Patient:4, symptoms:2, drug:3, health:2, clinic:2, doctor:2 Death:2, cancer:4, drug:3, public:4, health:4, director:1 Medical:2, cost:3, increase:2, patient:2, health:3, care:2 4) Anomaly Detection: is the task of identifying observations whose characteristics are significantly different from the rest of the data. Such observations are known as anomalies or outliers. Lecturer: Syed Khutubuddin Ahmed Contact: khutub27@gmail.com Page 4

5 3.6 Data: Applications of anomalies are: fraud detection, network intrusion, unusual patterns of diseases, and ecosystem disturbances. What is Data? Collection of data objects and their attributes What is an Attribute? An attribute is a property or characteristic of an object Examples: eye color of a person, temperature, etc. Attribute is also known as variable, field, characteristics, or feature What is an Object? A collection of attributes describe an object Object is also known as record, point, case, sample, entity, or instance 3.7 Types of Data 1) Attributes and Measures: Attribute values are numbers or symbols assigned to an attribute Distinction between attributes and attribute values: Same attribute can be mapped to different attribute values Example: height can be measured in feet or meters Different attributes can be mapped to the same set of values Example: Attribute values for ID and age are integers Note: properties of attribute values can be different like you can find the average ages of persons but you cannot find the average ID s Types of Attributes: Nominal (particular identity) Examples: ID numbers, eye color, zip codes Ordinal (measurable) Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short} Interval (range between the two) Examples: calendar dates, temperatures in Celsius or Fahrenheit. Lecturer: Syed Khutubuddin Ahmed Contact: khutub27@gmail.com Page 5

6 Ratio () Examples: temperature in Kelvin, length, time, counts Properties of Attribute Values: The type of an attribute depends on which of the following properties it possesses: Distinctness: = Order: < > Addition: + - Multiplication: * / Nominal attribute: Uses only distinctness Ordinal attribute: Uses distinctness & order Interval attribute: Uses distinctness, order & addition Ratio attribute: Uses all 4 properties Describing Attributes by the nature of Values: Discrete Attribute (Integers) Has only a finite or countably infinite set of values Examples: zip codes, ID Numbers, or the set of words in a collection of documents Often represented as integer variables. Note: binary attributes are a special case of discrete attributes Continuous Attribute (Floating point) Has real numbers as attribute values Examples: temperature, height, or weight. Practically, real values can only be measured and represented using a finite number of digits. Continuous attributes are typically represented as floating-point variables. 3.7 Types of Data Sets Types of data is grouped in to three groups 1) Record data Transaction or market based Data Data Matrix Document Data or sparse data matrix 2) Graph data Data with relationship among objects (World Wide Web Data with objects that are Graphs (Molecular Structures) 3) Ordered data Sequential Data Lecturer: Syed Khutubuddin Ahmed Contact: khutub27@gmail.com Page 6

7 Sequence Data or Genetic Sequence Data Time series or Temporal Data Spatial Data Three characteristics that apply to many data sets are: i) Dimensionality The dimensionality of data set is the number of attributed that the objects in the data set possess. Data with small number of dimensions tends to be qualitatively different than moderate or high dimensional data. The difficulty associated with analyzing high dimensional data are sometimes referred to as the curse of dimensionality ii) Sparsity Data with asymmetric features, most of the attribute values are zero s, in practice terms, sparsity is an advantage because usually only non-zero values to be stored and manipulated. This results in significant savings with respect to computation time and storage. Some of the Data mining algorithms work well for sparse data. iii) Resolution It is frequently possible to obtain data different levels of resolution, often the properties of the data are different at different resolution. Ex: the surface of the earth seems very uneven at a resolution of few meters, but is relatively smooth at a resolution of tens of kilometers. Ex: Photo Pixels (higher the pixel resolution clears the image lesser the resolution image is blurred. Detailed view on three types of data: 1) Record data: Record data set is a collection of data objects, which consists of fixed set of data fields (attributes). Record data is usually stored in flat files Or relational tables Types of record data are: Transaction or market based data The data matrix The sparse data matrix Lecturer: Syed Khutubuddin Ahmed Contact: khutub27@gmail.com Page 7

8 A) Transaction or Market Basket Data: Transaction data is a special type of record data, where each record (transaction) involves a set of items. B) The Data Matrix: If a set of objects have same set numeric attributes then the data objects will be known as points in a multidimensional space. A set of such data objects can be interpreted as M by N matrix. C) The Sparse Data Matrix: This matrix only contains non zero values Ex: fig(d) document term matrix. staff in the enterprise. Lecturer: Syed Khutubuddin Ahmed Contact: khutub27@gmail.com Page 8

9 2) Graph Based data: A Graph can sometime be convenient and powerful representation for data. A. Data with relationships among objects: relationships among objects frequently convey important information ex: web pages B. data with objects that are graphs: If objects have structure, that is the object contains sub objects that have relationship, then such objects are frequently represented as graphs. Ex: benzene molecule 3) Ordered Data: a. Sequential Data: Sequential data also referred as temporal data, can be thought of as an extension of record data, where each record has time associated with it. b. Sequence Data or Genetic Sequence Data: Sequence data consists of a data set that is a sequence of individual entities, such as sequence of words or letters. Lecturer: Syed Khutubuddin Ahmed Contact: khutub27@gmail.com Page 9

C. Time Series Data: Time series data is a special type of sequential data in which each record is a time series, i.e. a series of measurements taken over time. D. Spatial Data: Some objects have spatial attributes, such as positions or areas, as well as other types of attributes.

10 C. Time Series Data: Time series data is a special type of sequential data in which each record is a time series, i.e. a series of measurements taken over time. D. Spatial Data: Some objects have spatial attributes, such as positions or areas, as well as other types of attributes. An example of spatial data is weather data that is collected for a variety of geographical location. Lecturer: Syed Khutubuddin Ahmed Contact: khutub27@gmail.com Page 10

11 3.8 Data Processing Different Data Processing Techniques are: 1. Aggregation 2. Sampling 3. Dimensionality reduction 4. Feature creation 5. Discretization and binarization 6. Variable transformation 1. Aggregation: Combining two or more attributes (or objects) into a single attribute (or object) Purpose Data reduction Reduce the number of attributes or objects Change of scale Cities aggregated into regions, states, countries, etc More stable data Aggregated data tends to have less variability 2. Sampling Sampling is the main technique employed for data selection. It is often used for both the preliminary investigation of the data and the final data analysis. Statisticians sample because obtaining the entire set of data of interest is too expensive or time consuming. Sampling is used in data mining because processing the entire set of data of interest is too expensive or time consuming. ANALOGY: (Rice : to see whether the rice is cooked or not we only see one particle of it not all the rice particles) Lecturer: Syed Khutubuddin Ahmed Contact: khutub27@gmail.com Page 11

12 Types of Sampling: Simple Random Sampling There is an equal probability of selecting any particular item» There are two variations on random sampling: 1) Sampling without replacement As each item is selected, it is removed from the population 2) Sampling with replacement Objects are not removed from the population as they are selected for the sample. In sampling with replacement, the same object can be picked up more than once Stratified sampling Split the data into several partitions; then draw random samples from each partition Progressive Sampling: If the proper sample size selection is difficult then adaptive or progressive sampling is used. Then these approaches start with a small sample, and then increase the sample size until a sample of sufficient size has been obtained. 3. Dimensionality Reduction: Complexity of data increases as the dimensions grows in data. Curse of Dimensionality: When dimensionality increases, data becomes increasingly sparse in the space that it occupies For clustering, the definitions of density and distance between points, this is critical for clustering and outlier detection, become less meaningful. As a result many clustering and classification algorithms have trouble with high dimensional data, as it results in reduced classification accuracy and poor quality clusters. Purpose: Avoid curse of dimensionality Reduce amount of time and memory required by data mining algorithms Allow data to be more easily visualized May help to eliminate irrelevant features or reduce noise Linear Algebra Techniques for dimensionality reduction: Principle Component Analysis Singular Value Decomposition Others: supervised and non-linear techniques Lecturer: Syed Khutubuddin Ahmed Contact: khutub27@gmail.com Page 12

13 Feature Subset Selection: Another way to reduce dimensionality of data is to use only subset of features. It might seems that such approach would lose information, this is not the case if redundant and irrelevant features are present. Redundant features duplicate much or all of the information contained in one or more other attributes Example: purchase price of a product and the amount of sales tax paid Irrelevant features contain no information that is useful for the data mining task at hand Example: students' ID is often irrelevant to the task of predicting students' Grade Point Averages Techniques for feature selection: Brute-force approach: Try all possible feature subsets as input to data mining algorithm Embedded approaches: Feature selection occurs naturally as part of the data mining algorithm Filter approaches: Features are selected before data mining algorithm is run Wrapper approaches: Use the data mining algorithm as a black box to find best subset of attributes 4. Feature Creation: Create new attributes that can capture the important information in a data set much more efficiently than the original attributes Three general methodologies: Feature Extraction creation of new set of features from the original raw data is known as feature creation Mapping Data to New Space: A totally different view of data that can reveal important and interesting features. Lecturer: Syed Khutubuddin Ahmed Contact: khutub27@gmail.com Page 13

14 Feature Construction combining features to get better features than the original 5. Discretization and Binarization Some data mining algorithms need data to be in the form of categorical attributes. And algorithms that find association patters require that the data be in the form of binary attributes. Thus transforming continuous attributes into a categorical attribute is called discretization. And transforming continuous and discrete attributes into binary attributes is called as Binarization. 6. Variable transformation A variable transformation refers to a transformation that is applied to all the values of a variable, or even attributes. In each object the transformation is applied to the value of the variable for that object. Ex: converting a floating point value to an absolute value. two types of variable transformation: Simple function normalization two types of variable transformation: Simple function: for this type of variable transformation, a simple mathematical function is applied to each value individually. If x is a variable, then example of such transformation include x k, log x, e x, 1/x, sinx or x Normalization: the goal of normalization of standardization is to make an entire set of values have a particular property. Standard deviation is one of the example of standardization where making entire set of values have a common property Lecturer: Syed Khutubuddin Ahmed Contact: khutub27@gmail.com Page 14

15 3.9 Measures of Similarity and Dissimilarity: Basics: Similarity Numerical measure of how alike two data objects are. Is higher when objects are more alike. Often falls in the range [0,1] Dissimilarity Numerical measure of how different are two data objects Lower when objects are more alike Minimum dissimilarity is often 0 Upper limit varies Similarity and Dissimilarity between Simple Attributes: If p and q are the attribute values for two data objects. With respect to ordinal attributes: Consider an attribute that measures the quality of the product: eg: candy bar on the scale {poor, fair, OK, good, wonderful} It would seem reasonable that a product, PI, which is rated wonderful, would be closer to a product P2, which is rated good, than it would be to a product P3, which is rated OK. To make this observation quantitative, the values of the ordinal attribute are often mapped to successive integers, beginning at 0 or 1, e.g., {poor=0, fair=l, OK=2. good=3, wonderful=4]. Lecturer: Syed Khutubuddin Ahmed Contact: khutub27@gmail.com Page 15

16 Then, d ( Pl, P2 ) = 3 2 = 1 or, if we want the dissimilarity to fall between 0 and 1, d ( P l, P 2 ) = (3-2)/5-1 = A similarity for ordinal attributes can then be defined as s = 1- d. Possible Questions from This chapter: 1. What is data mining and why do we need data mining? Ans : Page-1 2. Write a note on Data mining and Knowledge discovery? Ans: Page-1 3. Explain the Challenges that motivated the use of data mining? Ans: page-2 4. Explain the data mining tasks in details? Ans: page-3 to 4 5. Write a note on Data, attribute and Object? Ans:Page-5 6. Explain in detail types of attributes? Ans:Page-5 to 6 7. Explain the different types of data sets in details with proper examples and figures? Ans: page-6 to Explain data processing in details with examples? Ans: Page-11 to Write a note on measures of similarities and dissimilarities? Ans: Page Lecturer: Syed Khutubuddin Ahmed Contact: khutub27@gmail.com Page 16

Data Mining: Data. What is Data? Lecture Notes for Chapter 2. Introduction to Data Mining. Properties of Attribute Values. Types of Attributes

Data Mining: Data. What is Data? Lecture Notes for Chapter 2. Introduction to Data Mining. Properties of Attribute Values. Types of Attributes 0 Data Mining: Data What is Data? Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Collection of data objects and their attributes An attribute is a property or characteristic