Data Preprocessing in Python Prof.Sushila Aghav Sushila.aghav@mitcoe.edu.in
Content Why preprocess the data? Descriptive data summarization Data cleaning Data integration and transformation April 24, 2018 2
Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation= noisy: containing errors or outliers e.g., Salary= -10 inconsistent: containing discrepancies in codes or names e.g., Age= 42 Birthday= 03/07/1997 e.g., Was rating 1,2,3, now rating A, B, C e.g., discrepancy between duplicate records April 24, 2018 3
Why Is Data Preprocessing Important? No quality data, no quality mining results! Quality decisions must be based on quality data e.g., duplicate or missing data may cause incorrect or even misleading statistics. Data Analysis needs consistent integration of quality data Data extraction, cleaning, and transformation comprises the majority of the work of building a Data Analysis April 24, 2018 4
Major Tasks in Data Preprocessing Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration Integration of multiple databases, data cubes, or files Data transformation Normalization and aggregation Data reduction Obtains reduced representation in volume but produces the same or similar analytical results Data discretization Part of data reduction but with particular importance, especially for numerical data April 24, 2018 5
Forms of Data Preprocessing April 24, 2018 6
Measuring the Central Tendency Mean (algebraic measure) (sample vs. population): Weighted arithmetic mean: Trimmed mean: chopping extreme values Median: A holistic measure 1 x n Middle value if odd number of values, or average of the middle two values otherwise Estimated by interpolation (for grouped data): x n i 1 n i 1 w x i w i i n i 1 x i x N Mode Value that occurs most frequently in the data Unimodal, bimodal, trimodal Empirical formula: April 24, 2018 7
Data Cleaning Data cleaning tasks Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data Resolve redundancy caused by data integration April 24, 2018 8
Missing Data Data is not always available E.g., many tuples have no recorded value for several attributes, such as customer income in sales data Missing data may be due to equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time of entry not register history or changes of the data Missing data may need to be inferred. April 24, 2018 9
How to Handle Missing Data? Ignore the tuple: usually done when class label is missing (assuming the tasks in classification not effective when the percentage of missing values per attribute varies considerably. Fill in the missing value manually: tedious + infeasible? Fill in it automatically with a global constant : e.g., unknown, a new class?! the attribute mean the attribute mean for all samples belonging to the same class: smarter the most probable value: inference-based such as Bayesian formula or decision tree April 24, 2018 10
Missing Data with Pandas NaN string_data = pd.series(['aardvark', 'artichoke', np.nan, 'avocado']) string_data[0] = None df.method() dropna() dropna(how='all') dropna(axis=1, how='all') description Drop missing observations Drop observations where all cells is NA Drop column if all the values are missing dropna(thresh = 5) fillna(0), fillna({ deptno :10}), fillna(method= ffill ),fillna(method= bfill ); isnull() notnull() Drop rows that contain less than 5 non-missing values Replace missing values with zeros returns True if the value is missing Returns True for non-missing values April 24, 2018 11
Data Transformation : Removing Duplicates Data.duplicated() Data.drop_duplicates() Data.drop_duplicates([ deptno ]) Data.drop_duplicates([ deptno, salary ]) April 24, 2018 12
Example data.duplicated() data.drop_duplicates() data.drop_duplicates(['k1']) data.drop_duplicates(['k1', 'k2'], keep='last') April 24, 2018 13
Data Transformation : Mapping Function In [55]: lowercased = data['food'].str.lower() April 24, 2018 14
April 24, 2018 15
Replacing Values In [60]: data = pd.series([1., -999., 2., -999., -1000., 3.]) In [62]: data.replace(-999, np.nan) Out[62]: 0 1.0 1 NaN 2 2.0 3 NaN 4-1000.0 April 24, 2018 16
If you want to replace multiple values at once, you instead pass a list and then the substitute value: In [63]: data.replace([-999, -1000], np.nan) Out[63]: 0 1.0 1 NaN 2 2.0 3 NaN 4 NaN 5 3.0 dtype: float64 April 24, 2018 17
Noisy Data Noise: random error or variance in a measured variable Incorrect attribute values may due to faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention Other data problems which requires data cleaning duplicate records incomplete data inconsistent data April 24, 2018 18
How to Handle Noisy Data? Binning first sort data and partition into (equal-frequency) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. Regression smooth by fitting the data into regression functions Clustering detect and remove outliers Combined computer and human inspection detect suspicious values and check by human (e.g., deal with possible outliers) April 24, 2018 19
Binning Methods for Data Smoothing Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into equal-frequency (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34 April 24, 2018 20
Binning using Python In [75]: ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32] In [76]: bins = [18, 25, 35, 60, 100] In [77]: cats = pd.cut(ages, bins) In [78]: cats Out[78]: [(18, 25], (18, 25], (18, 25], (25, 35], (18, 25],..., (25, 35], (60, 100], (35,60], (35, 60], (25, 35]] Length: 12 Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]] April 24, 2018 21
Contd.. In [79]: cats.codes Out[79]: array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8) In [80]: cats.categories Out[80]: IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]] closed='right', dtype='interval[int64]') In [81]: pd.value_counts(cats) Out[81]: (18, 25] 5 (35, 60] 3 (25, 35] 3 (60, 100] 1 dtype: int64 April 24, 2018 22
Contd.. In [83]: group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior'] In [84]: pd.cut(ages, bins, labels=group_names) Out[84]: [Youth, Youth, Youth, YoungAdult, Youth,..., YoungAdult, Senior, MiddleAged, Mid dleaged, YoungAdult] Length: 12 Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior] April 24, 2018 23
Detecting and Filtering Outliers Filtering or transforming outliers is largely a matter of applying array operations. Consider a DataFrame with some normally distributed data April 24, 2018 24
Contd.. Suppose you wanted to find values in one of the columns exceeding 3 in absolute value: April 24, 2018 25
Contd.. April 24, 2018 26
Thank You!! April 24, 2018 27