Data Preprocessing in Python. Prof.Sushila Aghav

Similar documents
Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

Data Preprocessing. Slides by: Shree Jaswal

3. Data Preprocessing. 3.1 Introduction

2. Data Preprocessing

Data Mining: Concepts and Techniques

CS 521 Data Mining Techniques Instructor: Abdullah Mueen

CS6220: DATA MINING TECHNIQUES

UNIT 2 Data Preprocessing

K236: Basis of Data Science

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA

ECT7110. Data Preprocessing. Prof. Wai Lam. ECT7110 Data Preprocessing 1

ECLT 5810 Data Preprocessing. Prof. Wai Lam

UNIT 2. DATA PREPROCESSING AND ASSOCIATION RULES

Data Preprocessing. Data Mining 1

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1394

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4

Data Mining Concepts & Techniques

Jarek Szlichta

Data Mining and Analytics. Introduction

Chapter 2 Data Preprocessing

CS570 Introduction to Data Mining

Data Preprocessing. Chapter Why Preprocess the Data?

cse634 Data Mining Preprocessing Lecture Notes Chapter 2 Professor Anita Wasilewska

Data preprocessing Functional Programming and Intelligent Algorithms

Data Preprocessing UE 141 Spring 2013

By Mahesh R. Sanghavi Associate professor, SNJB s KBJ CoE, Chandwad

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

Data Collection, Preprocessing and Implementation

Road Map. Data types Measuring data Data cleaning Data integration Data transformation Data reduction Data discretization Summary

DATA PREPROCESSING. Tzompanaki Katerina

Dta Mining and Data Warehousing

Data Preprocessing. Komate AMPHAWAN

Data Mining: Concepts and Techniques. Chapter 2

Python for Data Analysis. Prof.Sushila Aghav-Palwe Assistant Professor MIT

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

Data Mining: Concepts and Techniques. Chapter 2

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Applications of visual analytics, data types 3 Data sources and preparation Project 1 out 4

DSC 201: Data Analysis & Visualization

Data Exploration and Preparation Data Mining and Text Mining (UIC Politecnico di Milano)

CS378 Introduction to Data Mining. Data Exploration and Data Preprocessing. Li Xiong

DSC 201: Data Analysis & Visualization

DSC 201: Data Analysis & Visualization

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler, Sanjay Ranka

CS570: Introduction to Data Mining

Cse352 Artifficial Intelligence Short Review for Midterm. Professor Anita Wasilewska Computer Science Department Stony Brook University

Information Management course

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

A Survey on Data Preprocessing Techniques for Bioinformatics and Web Usage Mining

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

2 CONTENTS. 3.8 Bibliographic Notes... 45

Data Preprocessing. Data Mining: Concepts and Techniques. c 2012 Elsevier Inc. All rights reserved.

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3

DSC 201: Data Analysis & Visualization

DSC 201: Data Analysis & Visualization

CHAPTER-13. Mining Class Comparisons: Discrimination between DifferentClasses: 13.4 Class Description: Presentation of Both Characterization and

Data Preprocessing. Erwin M. Bakker & Stefan Manegold.

Data Mining Course Overview

Sponsored by AIAT.or.th and KINDML, SIIT

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3

Data Preparation. Data Preparation. (Data pre-processing) Why Prepare Data? Why Prepare Data? Some data preparation is needed for all mining tools

Question Bank. 4) It is the source of information later delivered to data marts.

Math 214 Introductory Statistics Summer Class Notes Sections 3.2, : 1-21 odd 3.3: 7-13, Measures of Central Tendency

Course on Data Mining ( )

Data Mining By IK Unit 4. Unit 4

Data Preprocessing. Outline. Motivation. How did this happen?

Chapter 6: DESCRIPTIVE STATISTICS

Measures of Central Tendency. A measure of central tendency is a value used to represent the typical or average value in a data set.

Thomas Vincent Head of Data Science, Getty Images

Measures of Central Tendency

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things.

Basic Data Mining Technique

Data Mining: Exploring Data. Lecture Notes for Chapter 3

Domestic electricity consumption analysis using data mining techniques

Slides for Data Mining by I. H. Witten and E. Frank

Measures of Central Tendency

Data 100. Lecture 5: Data Cleaning & Exploratory Data Analysis

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Table Of Contents: xix Foreword to Second Edition

Table of Contents (As covered from textbook)

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining

Data Mining Input: Concepts, Instances, and Attributes

Contents. Foreword to Second Edition. Acknowledgments About the Authors

CS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University

To calculate the arithmetic mean, sum all the values and divide by n (equivalently, multiple 1/n): 1 n. = 29 years.

SCHEME OF COURSE WORK. Data Warehousing and Data mining

Data 100 Lecture 5: Data Cleaning & Exploratory Data Analysis

PSS718 - Data Mining

Math 155. Measures of Central Tendency Section 3.1

Machine Learning - Clustering. CS102 Fall 2017

Frequency Distributions

IMPORTING & MANAGING FINANCIAL DATA IN PYTHON. Read, inspect, & clean data from csv files

Normalization and denormalization Missing values Outliers detection and removing Noisy Data Variants of Attributes Meta Data Data Transformation

Quartile, Deciles, Percentile) Prof. YoginderVerma. Prof. Pankaj Madan Dean- FMS Gurukul Kangri Vishwavidyalaya, Haridwar

Data Mining: Exploring Data

Mean Tests & X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242

Transcription:

Data Preprocessing in Python Prof.Sushila Aghav Sushila.aghav@mitcoe.edu.in

Content Why preprocess the data? Descriptive data summarization Data cleaning Data integration and transformation April 24, 2018 2

Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation= noisy: containing errors or outliers e.g., Salary= -10 inconsistent: containing discrepancies in codes or names e.g., Age= 42 Birthday= 03/07/1997 e.g., Was rating 1,2,3, now rating A, B, C e.g., discrepancy between duplicate records April 24, 2018 3

Why Is Data Preprocessing Important? No quality data, no quality mining results! Quality decisions must be based on quality data e.g., duplicate or missing data may cause incorrect or even misleading statistics. Data Analysis needs consistent integration of quality data Data extraction, cleaning, and transformation comprises the majority of the work of building a Data Analysis April 24, 2018 4

Major Tasks in Data Preprocessing Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration Integration of multiple databases, data cubes, or files Data transformation Normalization and aggregation Data reduction Obtains reduced representation in volume but produces the same or similar analytical results Data discretization Part of data reduction but with particular importance, especially for numerical data April 24, 2018 5

Forms of Data Preprocessing April 24, 2018 6

Measuring the Central Tendency Mean (algebraic measure) (sample vs. population): Weighted arithmetic mean: Trimmed mean: chopping extreme values Median: A holistic measure 1 x n Middle value if odd number of values, or average of the middle two values otherwise Estimated by interpolation (for grouped data): x n i 1 n i 1 w x i w i i n i 1 x i x N Mode Value that occurs most frequently in the data Unimodal, bimodal, trimodal Empirical formula: April 24, 2018 7

Data Cleaning Data cleaning tasks Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data Resolve redundancy caused by data integration April 24, 2018 8

Missing Data Data is not always available E.g., many tuples have no recorded value for several attributes, such as customer income in sales data Missing data may be due to equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time of entry not register history or changes of the data Missing data may need to be inferred. April 24, 2018 9

How to Handle Missing Data? Ignore the tuple: usually done when class label is missing (assuming the tasks in classification not effective when the percentage of missing values per attribute varies considerably. Fill in the missing value manually: tedious + infeasible? Fill in it automatically with a global constant : e.g., unknown, a new class?! the attribute mean the attribute mean for all samples belonging to the same class: smarter the most probable value: inference-based such as Bayesian formula or decision tree April 24, 2018 10

Missing Data with Pandas NaN string_data = pd.series(['aardvark', 'artichoke', np.nan, 'avocado']) string_data[0] = None df.method() dropna() dropna(how='all') dropna(axis=1, how='all') description Drop missing observations Drop observations where all cells is NA Drop column if all the values are missing dropna(thresh = 5) fillna(0), fillna({ deptno :10}), fillna(method= ffill ),fillna(method= bfill ); isnull() notnull() Drop rows that contain less than 5 non-missing values Replace missing values with zeros returns True if the value is missing Returns True for non-missing values April 24, 2018 11

Data Transformation : Removing Duplicates Data.duplicated() Data.drop_duplicates() Data.drop_duplicates([ deptno ]) Data.drop_duplicates([ deptno, salary ]) April 24, 2018 12

Example data.duplicated() data.drop_duplicates() data.drop_duplicates(['k1']) data.drop_duplicates(['k1', 'k2'], keep='last') April 24, 2018 13

Data Transformation : Mapping Function In [55]: lowercased = data['food'].str.lower() April 24, 2018 14

April 24, 2018 15

Replacing Values In [60]: data = pd.series([1., -999., 2., -999., -1000., 3.]) In [62]: data.replace(-999, np.nan) Out[62]: 0 1.0 1 NaN 2 2.0 3 NaN 4-1000.0 April 24, 2018 16

If you want to replace multiple values at once, you instead pass a list and then the substitute value: In [63]: data.replace([-999, -1000], np.nan) Out[63]: 0 1.0 1 NaN 2 2.0 3 NaN 4 NaN 5 3.0 dtype: float64 April 24, 2018 17

Noisy Data Noise: random error or variance in a measured variable Incorrect attribute values may due to faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention Other data problems which requires data cleaning duplicate records incomplete data inconsistent data April 24, 2018 18

How to Handle Noisy Data? Binning first sort data and partition into (equal-frequency) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. Regression smooth by fitting the data into regression functions Clustering detect and remove outliers Combined computer and human inspection detect suspicious values and check by human (e.g., deal with possible outliers) April 24, 2018 19

Binning Methods for Data Smoothing Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into equal-frequency (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34 April 24, 2018 20

Binning using Python In [75]: ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32] In [76]: bins = [18, 25, 35, 60, 100] In [77]: cats = pd.cut(ages, bins) In [78]: cats Out[78]: [(18, 25], (18, 25], (18, 25], (25, 35], (18, 25],..., (25, 35], (60, 100], (35,60], (35, 60], (25, 35]] Length: 12 Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]] April 24, 2018 21

Contd.. In [79]: cats.codes Out[79]: array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8) In [80]: cats.categories Out[80]: IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]] closed='right', dtype='interval[int64]') In [81]: pd.value_counts(cats) Out[81]: (18, 25] 5 (35, 60] 3 (25, 35] 3 (60, 100] 1 dtype: int64 April 24, 2018 22

Contd.. In [83]: group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior'] In [84]: pd.cut(ages, bins, labels=group_names) Out[84]: [Youth, Youth, Youth, YoungAdult, Youth,..., YoungAdult, Senior, MiddleAged, Mid dleaged, YoungAdult] Length: 12 Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior] April 24, 2018 23

Detecting and Filtering Outliers Filtering or transforming outliers is largely a matter of applying array operations. Consider a DataFrame with some normally distributed data April 24, 2018 24

Contd.. Suppose you wanted to find values in one of the columns exceeding 3 in absolute value: April 24, 2018 25

Contd.. April 24, 2018 26

Thank You!! April 24, 2018 27