Proximity and Data Pre-processing

Similar documents
Data Mining: Data. What is Data? Lecture Notes for Chapter 2. Introduction to Data Mining. Properties of Attribute Values. Types of Attributes

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Preprocessing. Data Preprocessing

Introduction to Clustering

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka

CS7267 MACHINE LEARNING NEAREST NEIGHBOR ALGORITHM. Mingon Kang, PhD Computer Science, Kennesaw State University

Machine Learning Feature Creation and Selection

Chapter 3: Data Mining:

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler, Sanjay Ranka

Time Series Analysis DM 2 / A.A

CSE4334/5334 Data Mining 4 Data and Data Preprocessing. Chengkai Li University of Texas at Arlington Fall 2017

Week 2 Engineering Data

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20

Cluster analysis. Agnieszka Nowak - Brzezinska

Chapter DM:II. II. Cluster Analysis

Overview of Clustering

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\

CS 521 Data Mining Techniques Instructor: Abdullah Mueen

CS570: Introduction to Data Mining

Getting to Know Your Data

Data Preprocessing. Slides by: Shree Jaswal

ECLT 5810 Clustering

Hierarchical Clustering

ECLT 5810 Clustering

What is Cluster Analysis?

Cluster Analysis. Angela Montanari and Laura Anderlucci

Knowledge Discovery in Databases II Summer Term 2017

Introduction to Clustering and Classification. Psych 993 Methods for Clustering and Classification Lecture 1

Clustering. Distance Measures Hierarchical Clustering. k -Means Algorithms

Data Preprocessing. Javier Béjar AMLT /2017 CS - MAI. (CS - MAI) Data Preprocessing AMLT / / 71 BY: $\

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Data Exploration and Preparation Data Mining and Text Mining (UIC Politecnico di Milano)

Homework # 4. Example: Age in years. Answer: Discrete, quantitative, ratio. a) Year that an event happened, e.g., 1917, 1950, 2000.

Mining di Dati Web. Lezione 3 - Clustering and Classification

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Statistical Pattern Recognition

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Statistical Pattern Recognition

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri

3. Data Preprocessing. 3.1 Introduction

Dimension Reduction CS534

2. Data Preprocessing

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1

What is Unsupervised Learning?

Cluster Analysis. CSE634 Data Mining

Community Detection. Jian Pei: CMPT 741/459 Clustering (1) 2

vector space retrieval many slides courtesy James Amherst

Data Mining Algorithms

Statistical Pattern Recognition

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

Clustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017

Modern Multidimensional Scaling

MIA - Master on Artificial Intelligence

Algorithms for Nearest Neighbors

Gene Clustering & Classification

Efficient Processing of Multiple DTW Queries in Time Series Databases

Nearest Neighbor Search by Branch and Bound

PESIT Bangalore South Campus

Non-Bayesian Classifiers Part I: k-nearest Neighbor Classifier and Distance Functions

Introduction to Computer Science

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Data Mining Clustering

Clustering Tips and Tricks in 45 minutes (maybe more :)

Branch and Bound. Algorithms for Nearest Neighbor Search: Lecture 1. Yury Lifshits

Vector Space Models: Theory and Applications

Modern Multidimensional Scaling

Unsupervised Learning

Forestry Applied Multivariate Statistics. Cluster Analysis

Ohio Tutorials are designed specifically for the Ohio Learning Standards to prepare students for the Ohio State Tests and end-ofcourse

Exercise 2. AMTH/CPSC 445a/545a - Fall Semester September 21, 2017

Road map. Basic concepts

Data Preprocessing UE 141 Spring 2013

Knowledge Discovery and Data Mining 1 (VO) ( )

Introduction to Similarity Search in Multimedia Databases

CBioVikings. Richard Röttger. Copenhagen February 2 nd, Clustering of Biomedical Data

Week 7 Picturing Network. Vahe and Bethany

Network Traffic Measurements and Analysis

ECLT 5810 Data Preprocessing. Prof. Wai Lam

Machine Learning (BSMC-GA 4439) Wenke Liu

Searching and mining sequential data

Machine Learning and Pervasive Computing

Clustering Billions of Images with Large Scale Nearest Neighbor Search

doc. RNDr. Tomáš Skopal, Ph.D. Department of Software Engineering, Faculty of Information Technology, Czech Technical University in Prague

Math 734 Aug 22, Differential Geometry Fall 2002, USC

Table Of Contents: xix Foreword to Second Edition

Music Synchronization

Unsupervised Learning

Clustering Analysis Basics

University of Florida CISE department Gator Engineering. Clustering Part 5

Problem 1: Complexity of Update Rules for Logistic Regression

Clustering. Content. Typical Applications. Clustering: Unsupervised data mining technique

Discriminate Analysis

Efficient ERP Distance Measure for Searching of Similar Time Series Trajectories

UNIT 2 Data Preprocessing

Clustering Part 4 DBSCAN

Using Machine Learning to Optimize Storage Systems

Final Exam Solutions

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval

Transcription:

Proximity and Data Pre-processing Slide 1/47 Proximity and Data Pre-processing Huiping Cao

Proximity and Data Pre-processing Slide 2/47 Outline Types of data Data quality Measurement of proximity Data preprocess

Proximity and Data Pre-processing Slide 3/47 Similarity and Dissimilarity Dissimilarity (distance) Numerical measure of how different two data objects are Lower when objects are more alike Minimum dissimilarity is often 0 Upper limit varies Similarity Numerical measure of how alike two data objects are. Is higher when objects are more alike. Often falls in the range [0,1] Proximity refers to either a similarity or dissimilarity

Proximity and Data Pre-processing Slide 4/47 Similarity and Dissimilarity (cont.) Simple attributes, multiple attributes E.g., Dense data: Euclidean distance E.g., Sparse data: Jaccard and Cosine similarity

Proximity and Data Pre-processing Slide 5/47 Similarity/Dissimilarity for Simple Attributes p and q are the attribute values for two data objects. Attribute type Dissimilarity Similarity { { 0 if p = q 1 if p = q Nominal d = s = 1 if p q 0 if p q Ordinal d = p q (values mapped n 1 to integers 0 to n-1, where n is the number of values) s = 1 p q n 1 Interval or Ratio d = p q s = d, s = 1 d+1 or s = 1 d min d max d min d where min d and max d are the minimum and maximum distances between every two values.

Proximity and Data Pre-processing Slide 6/47 Similarity and Dissimilarity multiple attributes Euclidean, Minkowski, DTW, etc. SMC, Jaccard, Cosine, Hamming, etc.

Proximity and Data Pre-processing Slide 7/47 Euclidean Distance dist = n (q i q i )2 i=1 n is the number of dimensions (attributes) q i and q i : the ith attributes (components) of data objects q and q. Standardization is necessary, if scales differ.

Proximity and Data Pre-processing Slide 8/47 Euclidean Distance Example Euclidean Distance 3 2 p1 p3 p4 1 p2 0 0 1 2 3 4 5 6 point x y p1 0 2 p2 2 0 p3 3 1 p4 5 1 p1 p2 p3 p4 p1 0 2.828 3.162 5.099 p2 2.828 0 1.414 3.162 p3 3.162 1.414 0 2 p4 5.099 3.162 2 0 Distance Matrix CS479/579: Data Mining Slide-30

Proximity and Data Pre-processing Slide 9/47 Manhattan Distance Considered by Hermann Minkowski in 19th century Germany. n dist = ( q i q i ) i=1 Taxi cab metric/distance, rectilinear distance, Minkowski s L 1 norm/distance, city block distance, or Manhattan length. Parameters Applications n is the number of dimensions (attributes) q i and q o are, respectively, the ith attributes (components) of data objects q and q. In chess, the distance between squares on the chessboard for rooks is measured in Manhattan distance The length of the shortest path a taxi could take between two intersections equal to the intersections distance in taxicab geometry. Integrated circuits where wires only run parallel to the X or Y axis.

Proximity and Data Pre-processing Slide 10/47 Minkowski Distance The Minkowski distance is a metric on Euclidean space which can be considered as a generalization of both the Euclidean distance and the Manhattan distance. dist = ( n q i q i p ) 1 p i=1 Parameters p is a parameter n is the number of dimensions (attributes) q i and q i are, respectively, the ith attributes (components) of data objects q and q.

Minkowski Distance: Illustration p = 1. Manhattan distance. p = 2. Euclidean distance. p. Supremum, Chebyshev (L max norm, L norm) distance. n lim ( q i q i p ) 1 p = max n i=1 q i q i p i=1 Kings and queens use Chebyshev distance in Chess. L min norm n lim ( q i q i p ) 1 p = min n i=1 q i q i p i=1 Note: Do not confuse p with n, i.e., all these distances are defined for all numbers of dimensions. roximity and Data Pre-processing Slide 11/47

Proximity and Data Pre-processing Slide 12/47 Minkowski Distance Example Minkowski Distance point x y p1 0 2 p2 2 0 p3 3 1 p4 5 1 L1 p1 p2 p3 p4 p1 0 4 4 6 p2 4 0 2 4 p3 4 2 0 2 p4 6 4 2 0 L2 p1 p2 p3 p4 p1 0 2.828 3.162 5.099 p2 2.828 0 1.414 3.162 p3 3.162 1.414 0 2 p4 5.099 3.162 2 0 L p1 p2 p3 p4 p1 0 2 3 5 p2 2 0 1 3 p3 3 1 0 2 p4 5 3 2 0 CS479/579: Data Mining Slide-33 Distance Matrix

Proximity and Data Pre-processing Slide 13/47 Issues to Consider in Calculating Distance Attributes have different scales Attributes are correlated Objects are composed of different types of attributes (Qualitative vs. quantitative) Attributes have different weights

Proximity and Data Pre-processing Slide 14/47 Common Properties of a Distance Distances, such as the Euclidean distance, have some well known properties. Positivity: d(p, q) 0 for all p and q; d(p, q) = 0 only if p = q. Symmetry: d(p, q) = d(q, p) for all p and q. Triangle Inequality: d(p, r) d(p, q) + d(q, r) for all points p, q, and r. A distance that satisfies these properties is a metric

Proximity and Data Pre-processing Slide 15/47 Elastic distances Dynamic Time Warping (DTW) Edit distance based measure Longest Common SubSequence (LCSS) Edit Distance on Real Sequence (EDR)

1 M. Müller, Information Retrieval for Music and Motion, Springer, 2007, ISBN: 978-3-540-74047-6, http://www.springer.com/978-3-540-74047-6 Proximity and Data Pre-processing Slide 16/47 Elastic distances - DTW In time series analysis, Dynamic Time Warping (DTW) is an algorithm for measuring similarity between two temporal sequences which may vary in time or speed. DTW is a technique to find an optimal alignment 1 between time series if one time series can be warped non-linearly by stretching or shrinking it. Applications: similarities in walking patterns, video, audio, etc. Calculates an optimal match between two given sequences (e.g., time series) with certain restrictions

Proximity and Data Pre-processing Slide 17/47 Elastic distances - DTW - Problem definition Given a reference sequence s[1,, n] and a query sequence t[1,, m] and local distance measure dist(s[i], t[j]) Goal: find an alignment between s and t having minimal overall cost An (n,m)-warping path (or simply referred to as warping path if n and m are clear from the context) is a sequence p=(p 1,, p L ) with p i = (n i, m i ) [1 : n] [1 : m] for i (1 i L) satisfying the following three conditions Boundary condition: p 1 =(1,1) and p L =(n,m) Monotonicity condition: n 1 n 2 n L and m 1 m 2 m L Step size condition: p i+1 p i in {(1,0),(0,1),(1,1)} for i (1 i L 1)

Proximity and Data Pre-processing Slide 18/47 Elastic distances - DTW - Distance Total cost of a warping path p=(p 1,, p L ) between s and t is L L cost p(s, t) = dist(p i ) = dist(s[n i ], t[m i ]) i=1 i=1 where p i = (n i, m i ). Let p be an optimal warping path between s and t The DTW distance DTW(s,t) between s and t is defined as the total cost of p. DTW (s, t) = cost p (s, t) = min {cost p(s, t) p is an (n,m)-warping path}

Proximity and Data Pre-processing Slide 19/47 Elastic distances - DTW - Classical Algorithm Step 1: builds a matrix D of size (n+1) (m+1) by following a dynamic programming program Step 2: backtrack into the matrix D to obtain the entire optimal DTW path Example: http://www.cs.nmsu.edu/~hcao/teaching/ cs488508/note/3_dtw_eg.pdf

Proximity and Data Pre-processing Slide 20/47 Elastic distances - DTW - Classical Algorithm Complexity Time complexity: O(nm) Space complexity: O(nm) Properties DTW distance is well defined even though there may be several warping paths of minimal total cost. DTW distance is symmetric if the local cost measure dist is symmetric.

Proximity and Data Pre-processing Slide 21/47 Elastic distances - DTW - Distance DTW calculation with bands Hiroaki Sakoe and Seibi Chiba: Dynamic Programming Algorithm Optimization for Spoken Word Recognition, in IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-26, No. 1, 1978. More algorithms: http://www.cs.nmsu.edu/~hcao/ teaching/cs488508/readings/dtw2_kdd2016.pdf

Proximity and Data Pre-processing Slide 22/47 Common Properties of a Similarity Similarities also have some well known properties. Maximum similarity: s(p, q) = 1 only if p = q Symmetry: s(p, q) = s(q, p) for all p and q s(p, q) is the similarity between points (data objects) p and q.

Proximity and Data Pre-processing Slide 23/47 Similarity Between Binary Vectors Compute similarities using the following quantities M 01 = the number of attributes where p was 0 and q was 1 M 10 = the number of attributes where p was 1 and q was 0 M 00 = the number of attributes where p was 0 and q was 0 M 11 = the number of attributes where p was 1 and q was 1 Similarity coefficient In the range of [0,1]

Proximity and Data Pre-processing Slide 24/47 Distance Simple Matching Coefficient (SMC) Jaccard Coefficient Cosine Similarity Hamming distance

Proximity and Data Pre-processing Slide 25/47 Simple Matching Coefficient (SMC) SMC = Example: number of matches number of attributes = M 11 + M 00 M 01 + M 10 + M 11 + M 00 p = 1 0 0 0 0 0 0 0 0 0 q = 0 0 0 0 0 0 1 0 0 1 M 01 = 2, M 10 = 1, M 00 = 7, M 11 = 0 SMC = 7+0 10 = 0.7

Proximity and Data Pre-processing Slide 26/47 Jaccard Coefficient J = number of 11 matches number of attributes 00 attribute values = M 11 M 01 + M 10 + M 11 Handle asymmetric binary attributes Example: p = 1 0 0 0 0 0 0 0 0 0 q = 0 0 0 0 0 0 1 0 0 1 M 01 = 2, M 10 = 1, M 00 = 7, M 11 = 0 J = 0 2+1+0 = 0

Proximity and Data Pre-processing Slide 27/47 Cosine Similarity If d1 and d2 are two document vectors, then cos(d 1, d 2 ) = d 1 d 2 d 1 d 2 indicates vector dot product d is the length of vector d d 1 = 3 2 0 5 0 0 0 2 0 0 d 2 = 1 0 0 0 0 0 1 1 0 2 d 1 d 2 = 3 + 2 = 5 d 1 = 42 = 6.481 d 2 = 7 = 2.646 cos(d1, d2) = 5 6.481 2.646 = 0.2916

Proximity and Data Pre-processing Slide 28/47 Hamming distance Measure distance among vectors of equal length with nominal values. The number of positions at which the corresponding symbols are different. Example toned and roses is 3. 1011101 and 1001001 is 2. 2173896 and 2233796 is 3. Named after Richard Hamming. It is used in telecommunication to count the number of flipped bits in a fixed-length binary word as an estimate of error, and therefore is sometimes called the signal distance. Not suitable for comparing strings of different lengths, or strings where not just substitutions but also insertions or deletions have to be expected

Proximity and Data Pre-processing Slide 29/47 Correlation Correlation measures the linear relationship between objects To compute correlation, we standardize data objects, p and q, and then take their dot product covariance(p, q) corr(p, q) = std(p) std(q) = spq s p s q covariance(p, q) = s pq = 1 n (p i p)(q i q) n 1 i=1 std(p) = s p = 1 n (p i p) n 1 2 i=1 std(q) = s q = 1 n (q i q) n 1 2 p = 1 n n p i, q = 1 n i=1 i=1 n i=1 q i

Proximity and Data Pre-processing Slide 30/47 Correlation - Example > p<-c(1,3,5) > q<-c(2,3,5) > cor(p,q,method="pearson") [1] 0.9819805 OR > mean(p);mean(q);sd(p);sd(q);cov(p,q) [1] 3 [1] 3.333333 [1] 2 [1] 1.527525 [1] 3 > cov(p,q)/(sd(p)*sd(q)) [1] 0.9819805 Same as the result from cor() function

Proximity and Data Pre-processing Slide 31/47 General Approach for Combining Similarities Situation: attributes are of different types, an overall similarity is needed For the ith attribute, compute a similarity s i [0, 1] Define δ i for the ith attribute δ i = 0 (i.e., do not count this attribute) if the ith attribute is an asymmetric attribute and both objects have a value of 0 or if one of the objects has a missing value for the ith attribute δ i = 1 otherwise Compute the overall similarity similarity(p, q) = n i=1 δ is i n i=1 δ i

Proximity and Data Pre-processing Slide 32/47 Using Weights to Combine Similarities May not want to treat all attributes the same. Use weights w i [0, 1] and n i=1 w i = 1. similarity(p, q) = n i=1 w iδ i s i n i=1 δ i

Proximity and Data Pre-processing Slide 33/47 R: calculate SMC, Jaccard, Hamming > p<-c(1,0,0,0,0,0,0,0,0,0) > q<-c(0,0,0,0,0,0,1,0,0,1) #SMC > sum(p==q)/length(q) [1] 0.7 #Jaccard > pq <- p+q > pq [1] 1 0 0 0 0 0 1 0 0 1 > J <- sum(pq==2)/sum(pq>=1) > J [1] 0 #Hamming > sum(p!=q) [1] 3

Proximity and Data Pre-processing Slide 34/47 R: calculate cosine > d1<-c(3,2,0,5,0,0,0,2,0,0) > d2<-c(1,0,0,0,0,0,1,1,0,2) #Cosine > d1 %*% d2 [,1] [1,] 5 > sqrt(d1 %*% d1) [,1] [1,] 6.480741 > sqrt(d2 %*% d2) [,1] [1,] 2.645751 > (d1 %*% d2)/(sqrt(d1 %*% d1)*sqrt(d2 %*% d2)) [,1] [1,] 0.2916059

Proximity and Data Pre-processing Slide 35/47 R: Cosine package cosine {lsa}??cosine > d1<-c(3,2,0,5,0,0,0,2,0,0) > d2<-c(1,0,0,0,0,0,1,1,0,2) > require(lsa) > cosine(d1,d2) [,1] [1,] 0.2916059

Proximity and Data Pre-processing Slide 36/47 R: DTW dtw {dtw} > library("dtw") > s<-c(1,1,2,3,2,0) > t<-c(0,1,1,2,3,2,1) > d<-dtw(t,s,dist.method="manhattan",keep.internals=true, step.pattern=symmetric1) > d$steppattern Step pattern recursion: g[i,j] = min( g[i-1,j-1] + d[i,j ], g[i,j-1] + d[i,j ], g[i-1,j ] + d[i,j ], ) > d$costmatrix [,1] [,2] [,3] [,4] [,5] [,6] [1,] 1 2 4 7 9 9 [2,] 1 1 2 4 5 6 [3,] 1 1 2 4 5 6 [4,] 2 2 1 2 2 4 [5,] 4 4 2 1 2 5 [6,] 5 5 2 2 1 3 [7,] 5 5 3 4 2 2

Proximity and Data Pre-processing Slide 37/47 R: Hamming distance > install.packages("e1071") # this just need to be run once > library("e1071") > c1 <- c(1,0,1,1,1,0,1) > c2 <- c(1,0,0,1,0,0,1) > hamming.distance(c1,c2) [1] 2 > c3 <- c(2,1,7,3,8,9,6) > c4 <- c(2,2,3,3,7,9,6) > hamming.distance(c3,c4) [1] 0 problematic! not binary.

Proximity and Data Pre-processing Slide 38/47 R: Correlation cor {stats} > help(cor) > longley GNP.deflator GNP Unemployed Armed.Forces Population Year Employed 1947 83.0 234.289 235.6 159.0 107.608 1947 60.323 1948 88.5 259.426 232.5 145.6 108.632 1948 61.122 1949 88.2 258.054 368.2 161.6 109.773 1949 60.171 1950 89.5 284.599 335.1 165.0 110.929 1950 61.187 1951 96.2 328.975 209.9 309.9 112.075 1951 63.221 1952 98.1 346.999 193.2 359.4 113.270 1952 63.639... > cor(longley$armed.forces,longley$year,method="pearson") [1] 0.4172451

Proximity and Data Pre-processing Slide 39/47 Sampling It is often used for both the preliminary investigation of the data and the final data analysis. Statisticians sample because obtaining the entire set of data of interest is too expensive or time consuming. Sampling is used in data mining because processing the entire set of data of interest is too expensive or time consuming.

Proximity and Data Pre-processing Slide 40/47 Key Principle for Effective Sampling Using a sample will work almost as well as using the entire data sets, if the sample is representative A sample is representative if it has approximately the same property (of interest) as the original set of data

Proximity and Data Pre-processing Slide 41/47 Types of Sampling Simple Random Sampling: there is an equal probability of selecting any particular item Variations Sampling without replacement: As each item is selected, it is removed from the population Sampling with replacement: Objects are not removed from the population as they are selected for the sample. In sampling with replacement, the same object can be picked up more than once Stratified sampling: Split the data into several partitions; then draw random samples from each partition

Sample Size Sample Size 8000 points 2000 Points 500 Points Proximity and Data Pre-processing Slide 42/47 CS479/579: Data Mining Slide-7

Proximity and Data Pre-processing Slide 43/47 Curse of Dimensionality Data analysis becomes significantly harder as the dimensionality of the data increases. Data becomes increasingly sparse in the space that it occupies

Proximity and Data Pre-processing Slide 44/47 Dimensionality Reduction Purpose: Avoid curse of dimensionality Reduce amount of time and memory required by data mining algorithms Allow data to be more easily visualized May help to eliminate irrelevant features or reduce noise Techniques Principle Component Analysis (PCA) Singular Value Decomposition (SVD) Others: supervised and non-linear techniques

Proximity and Data Pre-processing Slide 45/47 Feature Subset Selection Another way to reduce dimensionality of data Redundant features Duplicate much or all of the information contained in one or more other attributes Example: purchase price of a product and the amount of sales tax paid Irrelevant features Contain no information that is useful for the data mining task at hand Example: students ID is often irrelevant to the task of predicting students GPA

Proximity and Data Pre-processing Slide 46/47 Feature Subset Selection Techniques (S.S.) Brute-force approach: Try all possible feature subsets as input to data mining algorithm Embedded approaches: Feature selection occurs naturally as part of the data mining algorithm Filter approaches Features are selected before data mining algorithm is run Wrapper approaches: Use the data mining algorithm as a black box to find best subset of attributes

Proximity and Data Pre-processing Slide 47/47 Feature Creation (S.S.) Create new attributes that can capture the important information in a data set much more efficiently than the original attributes Three general methodologies Feature extraction: domain-specific Mapping data to new space Fourier transform Wavelet transform Feature construction: combining features. E.g., Density in cluster analysis.