Data Preprocessing in Python. Prof.Sushila Aghav

Data Preprocessing in Python Prof.Sushila Aghav Sushila.aghav@mitcoe.edu.in

Content Why preprocess the data? Descriptive data summarization Data cleaning Data integration and transformation April 24, 2018 2

Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation= noisy: containing errors or outliers e.g., Salary= -10 inconsistent: containing discrepancies in codes or names e.g., Age= 42 Birthday= 03/07/1997 e.g., Was rating 1,2,3, now rating A, B, C e.g., discrepancy between duplicate records April 24, 2018 3

Why Is Data Preprocessing Important? No quality data, no quality mining results! Quality decisions must be based on quality data e.g., duplicate or missing data may cause incorrect or even misleading statistics. Data Analysis needs consistent integration of quality data Data extraction, cleaning, and transformation comprises the majority of the work of building a Data Analysis April 24, 2018 4

Major Tasks in Data Preprocessing Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration Integration of multiple databases, data cubes, or files Data transformation Normalization and aggregation Data reduction Obtains reduced representation in volume but produces the same or similar analytical results Data discretization Part of data reduction but with particular importance, especially for numerical data April 24, 2018 5

Forms of Data Preprocessing April 24, 2018 6

Measuring the Central Tendency Mean (algebraic measure) (sample vs. population): Weighted arithmetic mean: Trimmed mean: chopping extreme values Median: A holistic measure 1 x n Middle value if odd number of values, or average of the middle two values otherwise Estimated by interpolation (for grouped data): x n i 1 n i 1 w x i w i i n i 1 x i x N Mode Value that occurs most frequently in the data Unimodal, bimodal, trimodal Empirical formula: April 24, 2018 7

Data Cleaning Data cleaning tasks Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data Resolve redundancy caused by data integration April 24, 2018 8

Missing Data Data is not always available E.g., many tuples have no recorded value for several attributes, such as customer income in sales data Missing data may be due to equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time of entry not register history or changes of the data Missing data may need to be inferred. April 24, 2018 9

How to Handle Missing Data? Ignore the tuple: usually done when class label is missing (assuming the tasks in classification not effective when the percentage of missing values per attribute varies considerably. Fill in the missing value manually: tedious + infeasible? Fill in it automatically with a global constant : e.g., unknown, a new class?! the attribute mean the attribute mean for all samples belonging to the same class: smarter the most probable value: inference-based such as Bayesian formula or decision tree April 24, 2018 10

Missing Data with Pandas NaN string_data = pd.series(['aardvark', 'artichoke', np.nan, 'avocado']) string_data[0] = None df.method() dropna() dropna(how='all') dropna(axis=1, how='all') description Drop missing observations Drop observations where all cells is NA Drop column if all the values are missing dropna(thresh = 5) fillna(0), fillna({ deptno :10}), fillna(method= ffill ),fillna(method= bfill ); isnull() notnull() Drop rows that contain less than 5 non-missing values Replace missing values with zeros returns True if the value is missing Returns True for non-missing values April 24, 2018 11

Data Transformation : Removing Duplicates Data.duplicated() Data.drop_duplicates() Data.drop_duplicates([ deptno ]) Data.drop_duplicates([ deptno, salary ]) April 24, 2018 12

Example data.duplicated() data.drop_duplicates() data.drop_duplicates(['k1']) data.drop_duplicates(['k1', 'k2'], keep='last') April 24, 2018 13

Data Transformation : Mapping Function In [55]: lowercased = data['food'].str.lower() April 24, 2018 14

April 24, 2018 15

Replacing Values In [60]: data = pd.series([1., -999., 2., -999., -1000., 3.]) In [62]: data.replace(-999, np.nan) Out[62]: 0 1.0 1 NaN 2 2.0 3 NaN 4-1000.0 April 24, 2018 16

If you want to replace multiple values at once, you instead pass a list and then the substitute value: In [63]: data.replace([-999, -1000], np.nan) Out[63]: 0 1.0 1 NaN 2 2.0 3 NaN 4 NaN 5 3.0 dtype: float64 April 24, 2018 17

Noisy Data Noise: random error or variance in a measured variable Incorrect attribute values may due to faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention Other data problems which requires data cleaning duplicate records incomplete data inconsistent data April 24, 2018 18

How to Handle Noisy Data? Binning first sort data and partition into (equal-frequency) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. Regression smooth by fitting the data into regression functions Clustering detect and remove outliers Combined computer and human inspection detect suspicious values and check by human (e.g., deal with possible outliers) April 24, 2018 19

Binning Methods for Data Smoothing Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into equal-frequency (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34 April 24, 2018 20

Binning using Python In [75]: ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32] In [76]: bins = [18, 25, 35, 60, 100] In [77]: cats = pd.cut(ages, bins) In [78]: cats Out[78]: [(18, 25], (18, 25], (18, 25], (25, 35], (18, 25],..., (25, 35], (60, 100], (35,60], (35, 60], (25, 35]] Length: 12 Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]] April 24, 2018 21

Contd.. In [79]: cats.codes Out[79]: array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8) In [80]: cats.categories Out[80]: IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]] closed='right', dtype='interval[int64]') In [81]: pd.value_counts(cats) Out[81]: (18, 25] 5 (35, 60] 3 (25, 35] 3 (60, 100] 1 dtype: int64 April 24, 2018 22

Contd.. In [83]: group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior'] In [84]: pd.cut(ages, bins, labels=group_names) Out[84]: [Youth, Youth, Youth, YoungAdult, Youth,..., YoungAdult, Senior, MiddleAged, Mid dleaged, YoungAdult] Length: 12 Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior] April 24, 2018 23

Detecting and Filtering Outliers Filtering or transforming outliers is largely a matter of applying array operations. Consider a DataFrame with some normally distributed data April 24, 2018 24

Contd.. Suppose you wanted to find values in one of the columns exceeding 3 in absolute value: April 24, 2018 25

Contd.. April 24, 2018 26

Thank You!! April 24, 2018 27