Data Exploration & Visualization

Size: px

Start display at page:

Download "Data Exploration & Visualization"

Nickolas Carroll
6 years ago
Views:

1 Introduction to Data Mining Data Exploration & Visualization CPSC/AMTH 445a/545a Guy Wolf Yale University Fall 2016 CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

2 Outline 1 Tabular data Observations/Data-Points vs. Features/Attributes Qualitative vs. Quantitative attributes Qualitative: Nominal vs. Ordinal Quantitative: Interval vs. Ratio 2 Summary statistics Frequency, mode, & percentiles Mean & median Range & variance Covariance & correlation Data quality 3 Visualizations Box plots Histograms Star plots CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

3 Outline (cont.) Parallel coordinate plots Scatter plots Quiver plots 4 Transactional data Term matrix Text documents 5 Structured signals (e.g., audio and EEG) Fourier & wavelets Spectrogram & scalogram 6 Multidimensional signals (e.g., images and videos) Visualization with contour plots 7 Nonparametric (affinity-/distance-based) representations Graph data Visualization with matrix plots CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

4 What is data? CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

5 What is data? Experimental vs. observational data Experimental data Data collected from strictly controlled/designed experiments with efforts made to ensure statistical validity. Examples Medical clinical trials Election polls Observational data Data collected from real-world settings without control over the captured underlying phenomena. It is easier to collect and obtain, but results and conclusions from such data may be biased or inconclusive. Almost all data used in data mining is observational data. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

6 Tabular Data CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

7 Tabular data Organizing data in a table of observations-by-features is considered the most convenient and standard format for data analysis. Example Consider the following procedure: 1 From each machine, collect 3 temperature measurements (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes), and 2 power consumption values (MOBO, GPU) 2 Attach unique identifiers of the machine, OS, and hardware manufacturer 3 Every second, store a record with these values from every machine in the system. We end up with hundreds of thousands of records, each containing 12 fields. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

8 Tabular data Observations/Data-Points vs. Features/Attributes Features/attributes/properties/fields {}}{ Timestamp OS Temp CPU # proc Observations/objects/datapoints/samples/records /1/16 1:00 AM LNX 45 C 65% CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

9 Tabular data Types of features/attributes It is important to recognize the types of values each feature/attribute takes in order to understand which operations make sense for it. Examples Can we compute an average eye color? How do we compute the difference between phone numbers? Can we say today is twice as hot/cold as yesterday? This is similar to problems like 6 apples / 4 people = 1.5 apples per person, but 10 people / 4 car seats = 3 cars. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

10 Tabular data Qualitative vs. Quantitative attributes Attribute values can be split into two types: Qualitative attributes Attributes that take values from a (finite) set of categories are called categorical or qualitative attributes. In some sense, they describe an object/observation, rather than measure its properties. Quantitative attributes Attributes that represent quantities are called numerical or quantitative attributes. They provide concrete quantifiable measurements of an object/observation. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

11 Tabular data Qualitative: Nominal vs. Ordinal Qualitative attributes can be split further into two types: Nominal attributes Examples: zip codes, eye color, operating system, gender Values of such attributes just specify names without any particular order or relation between them (except for = and ). Binary attributes are nominal attributes with only two values (Yes/No or 0/1). They can be symmetric or asymmetric based in whether or not their values are equally informative. Ordinal attributes Examples: ratings, grades, street/avenue numbers Values of such attributes have some order, even though they don t specify an exact quantity CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

12 Tabular data Quantitative: Interval vs. Ratio Quantitative attributes can also be split into two types: Interval attributes Examples: calendar dates, azimuth direction, Fahrenheit temperatures Such attributes represent quantities with meaningful difference (or fixed intervals) between their values (but no multiplicative relations). Ratio attributes Examples: mass, length, distance, currency, age, electrical current Such attributes represent quantities that have meaningful ratios between their values. Unlike interval attributes, ratio ones usually have an absolute zero. We can also split quantities into discrete and continuous ones. All qualitative attributes are considered discrete. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

13 Tabular data Summary of attribute types The types of attributes can be regarded via the operations that can be applied to them: Comparison (= and ) - every type Ordering (> and <) - every type except nominal Differences ( ) and addition (+) - only quantitative Division (/) and multiplication (, ) - only ratio Other operations (e.g., mean, median, correlation) may also be inapplicable for some types while applicable to others. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

14 Tabular data Technical formats Tabular data can be stored, collected, or given in several standard formats, such as: Comma separated file (CSV) Flat file or delimited text file (e.g., space or tab delimited) XML or other log files Proprietary formats (e.g., FCS for biological data or MAT files for Matlab data) Database tables There are several techniques and standard designs to collect and store big data in databases. Data warehouse, ETL (extract-transform-load), and OLAP (Online Analytical Processing) are some related terms encountered frequently in the IT industry. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

15 Tabular data Data warehouse: star and snowflake schemas Star schema CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

16 Tabular data Data warehouse: star and snowflake schemas Snowflake schema CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

17 Summary statistics The raw representation of the data is often not convenient for initial exploration and understanding of the data. How do we get general insights into the data and its attributes as a whole? Summary statistics Properties that summarize global information, such as central tendency, spread, and variations of observations and features. These statistics provide an important first step in data analysis and most of them are not difficult to compute in linear time w.r.t the size of the data. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

18 Summary statistics Frequency, mode, & percentiles Frequency The portion (e.g., percentage) of the observation with each specific value of a categorical or discrete attribute. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

19 Summary statistics Frequency, mode, & percentiles Frequency The portion (e.g., percentage) of the observation with each specific value of a categorical or discrete attribute. Mode The most frequent value of an attribute in the data. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

20 Summary statistics Frequency, mode, & percentiles Frequency The portion (e.g., percentage) of the observation with each specific value of a categorical or discrete attribute. Mode The most frequent value of an attribute in the data. Percentiles The p-th percentile (with 0 p 100) of an attribute is a value P p such that p% of the observed values of this attributes are less than P p. We typically take P p as one of the observed values of the attributes. Alternatives: quartile Q i (i = 1, 2, 3), quantile, etc. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

21 Summary statistics Frequency, mode, & percentiles Frequency The portion (e.g., percentage) of the observation with each specific value of a categorical or discrete attribute. Mode The most frequent value of an attribute in the data. Percentiles The p-th percentile (with 0 p 100) of an attribute is a value P p such that p% of the observed values of this attributes are less than P p. We typically take P p as one of the observed values of the attributes. Alternatives: quartile Q i (i = 1, 2, 3), quantile, etc. Visual examples: stem-and-leaf displays; quantile & Q Q plots. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

22 Summary statistics Frequency, mode, & percentiles CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

23 Summary statistics Frequency, mode, & percentiles CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

24 Summary statistics Frequency, mode, & percentiles CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

25 Summary statistics Mean & median Mean The mean (or average) x = 1 n ni=1 x n is the most common way to measure the central location or value of data points. However, it is very sensitive to outliers. A trimmed mean is more robust to outliers by disregarding extreme values. Weighted mean also takes into account weights for each observation. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

26 Summary statistics Mean & median Mean The mean (or average) x = 1 n ni=1 x n is the most common way to measure the central location or value of data points. However, it is very sensitive to outliers. A trimmed mean is more robust to outliers by disregarding extreme values. Weighted mean also takes into account weights for each observation. Median The median of an attribute is a value such that half of the observed values are above it and half are below it. It is the middle value for an odd number of observations, or the average (when it makes sense) between the two middle numbers for an even number of observations. The median corresponds to P 50 and Q 2. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

27 Summary statistics Centrality and skewed data Relations between three measures of centrality (mean, median, and mode) can indicate symmetric or skewed distributions of attributes: symmetric positively skewed negatively skewed CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

28 Summary statistics Range & variance Range Range is the difference between max and min observed values of an attribute CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

29 Summary statistics Range & variance Range Range is the difference between max and min observed values of an attribute Variance Variance s 2 x = 1 n ni=1 (x i x) 2 and standard deviation (STD) s x = s 2 x are the most common ways to measure the spread of values. However, like the mean, they are sensitive to outliers. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

30 Summary statistics Range & variance Range Range is the difference between max and min observed values of an attribute Variance Variance s 2 x = 1 n ni=1 (x i x) 2 and standard deviation (STD) s x = s 2 x are the most common ways to measure the spread of values. However, like the mean, they are sensitive to outliers. Other spread measures include: average absolute deviation - the average of x i x median absolute deviation - the median of x i x interquartile range - the difference x 75% x 25% CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

31 Summary statistics Covariance & correlation Covariance Measures the degree to which attributes vary together and is computed by cov(x, y) = 1 n ni=1 (x i x)(y i ȳ). This value depends on the magnitude/spread of the attribute values. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

32 Summary statistics Covariance & correlation Covariance Measures the degree to which attributes vary together and is computed by cov(x, y) = 1 n ni=1 (x i x)(y i ȳ). This value depends on the magnitude/spread of the attribute values. For k attributes, these form a k k covariance matrix, with variances s 2 x = cov(x, x) on its diagonal. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

33 Notice that Pearson correlation is the covariance or dot-product between standardized attributes. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46 Summary statistics Covariance & correlation Covariance Measures the degree to which attributes vary together and is computed by cov(x, y) = 1 n ni=1 (x i x)(y i ȳ). This value depends on the magnitude/spread of the attribute values. For k attributes, these form a k k covariance matrix, with variances s 2 x = cov(x, x) on its diagonal. Correlation A value between 0 and 1 that indicates how strongly two attributes are (linearly) related. Pearson correlation: corr(x, y) = cov(x,y) s x s y. Notice that it is independent magnitudes/spreads and corr(x, x) = 1.

34 Summary statistics Covariance & correlation Correlation A value between 0 and 1 that indicates how strongly two attributes are (linearly) related. Pearson correlation: corr(x, y) = cov(x,y) s x s y. Notice that it is independent magnitudes/spreads and corr(x, x) = 1. Notice that Pearson correlation is the covariance or dot-product between standardized attributes. corr(x, y) = sx 1 sy 1 1 n (x i x)(y i ȳ) n i=1 = 1 n ( ) xi x ( ) y i ȳ n i=1 s x s y CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

35 Summary statistics Misleading example: pirates & global warming Taken from Wikipedia CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

36 Summary statistics Data quality Summary statistics enable identification of various data quality issues, such as Precision The closeness of repeated measurements to one another. Bias A systematic variation of measurements from the quantity being measured. Accuracy The closeness of measurements to the true value of the quantity being measured. Other issues include missing values, outliers, and duplicate values. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

37 Visualizations Why do we need visualizations? While summary statistics provide useful information about the data, they can be overwhelming and hard to track when many attributes are considered. Visualization Conversion of data into visual elements that express characteristics, relationships, and information about data points and attributes. Visualizations provide graphic representations that enable us to draw insights at a single glance. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

38 Visualizations Why do we need visualizations? Example CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

39 Visualizations Why do we need visualizations? Example (TreeMap) Taken from Wikipedia CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

40 Visualizations Why do we need visualizations? Example (TreeMap) Taken from Wikipedia CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

41 Visualizations What constitutes a good visualization? No really good answer... but there are some general guidelines: ACCENT principles Apprehension: we can correctly perceive relations among variables. Clarity: visually distinguish important relations and elements. Consistency: comparing graphical elements/displays shows faithful (dis)similarities in the data. Efficiency: Necessity: simplify complex relations and patterns in the visualization. only include necessary graphical elements - no extraneous elements. Truthfulness: true values (absolute or relative) can be determined from graphical elements. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

42 Visualizations Box plots Box plots (invented by J.Tukey) show the five-number distribution of attribute values based on percentiles: Outliers 3 90th percentile 75th percentile Median 25th percentile 10th percentile CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

43 Visualizations Histograms CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

44 Visualizations Histograms CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

45 Visualizations Histograms CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

46 Visualizations Histograms Taken from: Pierchala, C. The choice of age groupings may affect the quality of tabular presentations CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

47 Visualizations Star plots CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

48 Visualizations Star plots Taken from: CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

49 Visualizations Parallel coordinate plots Notice that attributes in this case do not have a particular order. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

50 Visualizations Scatter plots CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

51 Visualizations Quiver plots CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

52 Non-tabular Data CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

53 Transactional data In transactional data, each observation is a transaction that contains a collection of items or sequence of events. Example Market basket data Customer #1: {milk, bread, butter}; Customer #2: {orange juice, milk}; Customer #3: {orange juice, peanut butter, jelly, bread};... Transaction items can also contain numerical attributes, such as the number of purchased items (e.g., 3 boxes of cookies) or their price. When sequences (e.g., events, actions, or genes) are considered, temporal/order information may also be included. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

54 Transactional data Term matrix In some cases, transactional data can be converted to tabular form by considering term matrix (a.k.a. bag of words/features techniques). Example CustomerID milk bread butter O.J. cheese P.B. jelly Customer# Customer# Customer# This representation looses sequential information, and to applying it to continuous values requires a discretization step. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

55 Transactional data Text documents Text documents can be considered as transactional data in one of two ways: CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

56 Transactional data Text documents Text documents can be considered as transactional data in one of two ways: 1 Each document can be considered as a big transaction containing words. Bag of words techniques ignore grammatical structures and represent a document as a histogram of word occurrences. Similar approaches can also be applied to images, questionnaires, etc., with an appropriate dictionary-building clustering step. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

57 Transactional data Text documents Text documents can be considered as transactional data in one of two ways: 1 Each document can be considered as a big transaction containing words. Bag of words techniques ignore grammatical structures and represent a document as a histogram of word occurrences. Similar approaches can also be applied to images, questionnaires, etc., with an appropriate dictionary-building clustering step. 2 A document can be considered as a transactional dataset on its own, which contains word contexts (e.g., with n-grams or skip-grams). Word2vec techniques use this approach to associate numerical coordinates (typically in R 300 ) to words based on contexts in which they appear. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

58 Transactional data Text documents Example (Term analysis of Donald Trump s twits) Most frequent words: iphone vs. Android: Taken from varianceexplained.org/r/trump-tweets/ CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

59 Transactional data Text documents Taken from github.com/aubry74/visual-word2vec/ CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

60 Structured signals Structured signals have well known relations between their attributes. They are typically numerical, with temporal or spatial ordering. Examples Audio recordings EEG signals Heart rate Room temperatures Each data-point is then a signal collected over time (or space), and we can be analyzed with signal processing tools. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

61 Structured signals Fourier & power spectrum Time series CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

62 Structured signals Fourier & power spectrum Time series Power spectrum CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

63 Structured signals Fourier & power spectrum Time series Power spectrum CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

64 Structured signals STFT & wavelets STFT CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

65 Structured signals STFT & wavelets Wavelets CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

66 Structured signals STFT & wavelets Lowpass Scale 1 Scale 2 Haar wavelets CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

67 Structured signals Spectrogram & scalogram CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

68 Structured signals Spectrogram & scalogram Spectrogram Scalogram CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

69 Multidimensional signals Multidimensional signals are similar have several coordinates that specify the relations between their attributes. Examples Grayscale images have two spatial coordinates that determine pixel positions. Videos have two spatial & one temporal coordinates that determine pixel positions. Geographic data has two or three coordinates determining longitude, latitude, and elevation. Colored and hyperspectral images have two spatial coordinates and one spectral/channel coordinate. In general, many signal processing approaches can be extended from one-dimensional signals to multidimensional ones. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

70 Multidimensional signals Two-dimensional wavelets CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

71 Multidimensional signals Two-dimensional wavelets CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

72 Multidimensional signals Two-dimensional wavelets CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

73 Multidimensional signals Two-dimensional wavelets CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

74 Multidimensional signals Visualization with contour plots CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

75 Multidimensional signals Visualization with contour plots CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

76 Nonparametric representations In some cases, the important information in the data is the relations between data points, rather than their attributes. Examples Spatial locations and trajectories Phone calls and correspondences Gene interactions and cell progressions In these cases an affinity matrix, based on similarity or distances, between data points can be used for analysis. Essentially, each data point is represented by its relations to other data points rather than by its own attributes. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

77 Nonparametric representations Graph data Graphs can be used to formalize relations in data in two ways: CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

78 Nonparametric representations Graph data Graphs can be used to formalize relations in data in two ways: 1 Relationships between attributes can form graphs (e.g., molecule data). In this case each data point is a graph on its own, and this is a more complicated example of structured data. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

Benzene In this(c case 6 H 6 each ): data point is a graph on its own, and this is a more

79 Nonparametric representations Graph data Graphs can be used to formalize relations in data in two ways: 1 Relationships between attributes can form graphs (e.g., molecule data). Benzene In this(c case 6 H 6 each ): data point is a graph on its own, and this is a more complicated example of structured data. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

80 Nonparametric representations Graph data Graphs can be used to formalize relations in data in two ways: 1 Relationships between attributes can form graphs (e.g., molecule data). In this case each data point is a graph on its own, and this is a more complicated example of structured data. 2 The graph is considered as the dataset, and each node is a data point (e.g., social networks and web-reference data). In this case, an adjacency matrix can form an affinity matrix. Conversely, affinity matrices can form adjacency matrices, so nonparametric data is often considered as graph data. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

Nonparametric representations Graph data Graphs can be used to formalize relations in data in two ways: 1 Relationships between attributes can form gr

2 The graph is considered as the dataset, and each node is a data point (e.g., social networks and web-reference data).

81 Nonparametric representations Graph data Graphs can be used to formalize relations in data in two ways: 1 Relationships between attributes can form graphs (e.g., molecule data). In this case each data point is a graph on its own, and this is a more complicated example of structured data. 2 The graph is considered as the dataset, and each node is a data point (e.g., social networks and web-reference data). In this case, an adjacency matrix can form an affinity matrix. Conversely, affinity matrices can form adjacency matrices, so nonparametric data is often considered as graph data. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

82 Nonparametric representations Graph data Graphs can be used to formalize relations in data in two ways: 1 Relationships between attributes can form graphs (e.g., molecule data). In this case each data point is a graph on its own, and this is a more complicated example of structured data. 2 The graph is considered as the dataset, and each node is a data point (e.g., social networks and web-reference data). In this case, an adjacency matrix can form an affinity matrix. Conversely, affinity matrices can form adjacency matrices, so nonparametric data is often considered as graph data. Spectral graph methods (e.g., SVD of graph Laplacian) can be used to associate coordinates to data points in the second case visualization with scatter plots and further analysis. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

83 Nonparametric representations Visualization with matrix plots CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

84 Summary We considered the following data and attribute types, and briefly showed how to handle, process, and visualize them: Types of attributes Nominal Ordinal Interval Ratio Types of data Tabular data Transactional & text data Structured (1D, 2D, & more) data Nonparametric & graph data Exploratory data analysis crucial for obtaining intelligible results, e.g., by identifying valid applicable operations on the data and possibly transforming it to more amenable representation for analysis. Other preprocessing steps include normalization/standardization, sampling, discretization, aggregation and dimensionality reduction. CPSC 445 (Guy Wolf) Data Exploration Yale - Fall / 46

ECLT 5810 Data Preprocessing. Prof. Wai Lam

ECLT 5810 Data Preprocessing. Prof. Wai Lam ECLT 5810 Data Preprocessing Prof. Wai Lam Why Data Preprocessing? Data in the real world is imperfect incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate