Data Preprocessing Part 1

Similar documents
Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3

ECT7110. Data Preprocessing. Prof. Wai Lam. ECT7110 Data Preprocessing 1

Dta Mining and Data Warehousing

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Preprocessing. Slides by: Shree Jaswal

Knowledge Discovery and Data Mining

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA

1. Inroduction to Data Mininig

Data preprocessing Functional Programming and Intelligent Algorithms

Chapter 3. Foundations of Business Intelligence: Databases and Information Management

2. Data Preprocessing

Cost-Benefit Analysis of Retrospective vs. Prospective Data Standardization

UNIT 2 Data Preprocessing

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4

Unit 10 Databases. Computer Concepts Unit Contents. 10 Operational and Analytical Databases. 10 Section A: Database Basics

Data Preprocessing. Data Mining 1

UNIT 2. DATA PREPROCESSING AND ASSOCIATION RULES

Question Bank. 4) It is the source of information later delivered to data marts.

CRM-to-CRM Data Migration. CRM system. The CRM systems included Know What Data Will Map...3

ECLT 5810 Data Preprocessing. Prof. Wai Lam

A Survey on Data Preprocessing Techniques for Bioinformatics and Web Usage Mining

DSS based on Data Warehouse

Health Analytic Group. Research Data Management

Fig 1.2: Relationship between DW, ODS and OLTP Systems

3. Data Preprocessing. 3.1 Introduction

Introduction to SPSS Edward A. Greenberg, PhD

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Data Management Glossary

Knowledge Discovery. URL - Spring 2018 CS - MIA 1/22

Database Vs. Data Warehouse

DATA WAREHOUING UNIT I

Management Information Systems MANAGING THE DIGITAL FIRM, 12 TH EDITION FOUNDATIONS OF BUSINESS INTELLIGENCE: DATABASES AND INFORMATION MANAGEMENT

Moving to a Data Warehouse

Chapter 1, Introduction

DEVELOPING SQL DATA MODELS

K236: Basis of Data Science

CHAPTER 3 Implementation of Data warehouse in Data Mining

Oracle Database 11g: Data Warehousing Fundamentals

Web Data mining-a Research area in Web usage mining

Managing Dimension Hierarchies for Reporting

Introduction to Data Science

Integration Services ETL. SQL Server Integration Services. SQL Server Integration Services. Mag. Thomas Griesmayer

Getting more from your Engineering Data. John Chapman Regional Technical Manager

Cse352 Artifficial Intelligence Short Review for Midterm. Professor Anita Wasilewska Computer Science Department Stony Brook University

TIM 50 - Business Information Systems

Fundamentals of Information Systems, Seventh Edition

Qualitative Data Analysis Software. A workshop for staff & students School of Psychology Makerere University

ETL and OLAP Systems

Data Foundations. Topic Objectives. and list subcategories of each. its properties. before producing a visualization. subsetting

Chapter 6. Foundations of Business Intelligence: Databases and Information Management VIDEO CASES

Databases, Data Mining & Knowledge Discovery

Big Data For Oil & Gas

Data Mining and Analytics. Introduction

DATA PREPROCESSING. Tzompanaki Katerina

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Applications of visual analytics, data types 3 Data sources and preparation Project 1 out 4

INDEPTH Network. Introduction to ETL. Tathagata Bhattacharjee ishare2 Support Team

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

KNIME for the life sciences Cambridge Meetup

By Mahesh R. Sanghavi Associate professor, SNJB s KBJ CoE, Chandwad

TIM 50 - Business Information Systems

MEDICAL INFORMATICS & DATABASE MANAGEMENT MODULE 5: BIG DATA MANAGEMENT AND ANALYSIS DR.ORALUCK PATTANAPRATEEP

Data Mining. Asso. Profe. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of CS (1)

Managing Data Resources

Knowledge Modelling and Management. Part B (9)

Computer-based Tracking Protocols: Improving Communication between Databases

Data Set. What is Data Mining? Data Mining (Big Data Analytics) Illustrative Applications. What is Knowledge Discovery?

COMP33111: Tutorial/lab exercise 2

Data Preprocessing UE 141 Spring 2013

Handout 12 Data Warehousing and Analytics.

by Prentice Hall

Knowledge Discovery. Javier Béjar URL - Spring 2019 CS - MIA

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

In fact, in many cases, one can adequately describe [information] retrieval by simply substituting document for information.

Call: SAS BI Course Content:35-40hours

Web Usage Mining: A Research Area in Web Mining

Training 24x7 DBA Support Staffing. MCSA:SQL 2016 Business Intelligence Development. Implementing an SQL Data Warehouse. (40 Hours) Exam

CS 521 Data Mining Techniques Instructor: Abdullah Mueen

Data Warehouse and Data Mining

Managing Data Resources

Data Warehouse and Data Mining

Chapter 6 VIDEO CASES

International Journal of Scientific & Engineering Research, Volume 7, Issue 2, February ISSN

Specify The Following Queries In Sql On The Company Relational Database Schema Shown In Figure 3.5

E(xtract) T(ransform) L(oad)

Data Warehousing ETL. Esteban Zimányi Slides by Toon Calders

Data warehouse architecture consists of the following interconnected layers:

Data warehouse and Data Mining

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Domestic electricity consumption analysis using data mining techniques

Sql Fact Constellation Schema In Data Warehouse With Example

Chapter 11 Databases. Computer Concepts 2013

Chapter 3: Data Mining:

Technology In Action, Complete, 14e (Evans et al.) Chapter 11 Behind the Scenes: Databases and Information Systems

Transcription:

Data Preprocessing Part 1 HAP 780 Data Mining in Health Care Janusz Wojtusiak, PhD George Mason University Fall 2016

The world is full of obvious things which nobody by any chance ever observes. -Sherlock Holmes

Multiple sources Multiple formats Multiple representations Errors, noise Missing values Unnecessary attributes Not-representative data. and many more! Why Preprocessing?

Two Types of Preprocessing Before loading to database/software How to get data from multiple sources into database, data warehouse, or other format on which DM tools can be used. After loading to database/software This is what is typically covered by data preprocessing: data cleaning, transformation, reduction, discretization, normalization..

EHR systems Billing Surveys Reports Web Excel spreadsheets Sensors Sources of Data for Data Mining Sometimes we mine together data from multiple sources. Simply speaking, we want to be able to mine any data and all available data.

One-dimensional Forms of Data Signals from sensors (EKG, accelerometer, etc.) Two-dimensional Images Multidimensional Flat data tables (attribute-value pairs) Relational Databases Multimedia

Formats of Data Structured Tables Relational databases Non-relational/No-SQL databases Text Files (coma separated, special formats) XML Excel files SAS data files.

Unstructured Formats of Data Text files Websites Text fields in databases/structured data Speech Multimedia.

Dirty Data Noise Incompleteness Inconsistency

Dirty Data PTID DOB Age Sex ProvID Dx1 Dx2 Dx3 Dx4 Dx5 1 1/2/70 48 M 345 250.0 2 30 N 010.0 2 1/1/80 33 3456 487 34 487 Patient is suffering form berculosis 5 9/8/60 F 327.0 327.2 The following records are imported after January 473.0 6 8/8/54 M 320 250.0 487 296.7 361.0 E858 7 Unknow n M John Smith 8 25 F 377 150 151 038.9 How many problems are in this dataset?

Dealing with Dirty Data Load data to database Data Types Obvious problems in data files Data cleaning & transformation Inconsistencies, missing values, sampling, attribute selection, discretization,. http://www.prosoftsolutions.net/blog/bid/146041/dirty-data-what-is-it-how-does-it-cause-problems-andwhat-is-the-solution

Data Types Different names for the same Field: used in databases Attribute: used in data mining and machine learning Variable: used in statistics Feature: used in machine learning (usually means binary attribute) Database Attribute Types Analytic Attribute Types

Fundamental Concepts Symbol: a physical entity, its state, or its behavior that conveys a choice from a predefined set of choices. The choices may refer to any entities (physical or abstract objects), to their properties, or their actions. The choice indicated by a symbol is called its meaning Data: a recorded set of symbols characterizing a set of entities Information: interpreted data; data whose symbols have been assigned meaning Knowledge: information that is verified to be true or true to some degree, which can be obtained by direct observation or by inference Belief: hypothetical knowledge; knowledge that has not been validated, but is characterized by some measure of it s the relationship to the reality it describes.

Belief Knowledge Information Data Symbols

Fundamental Concepts Concept: a set of entities considered as a unit, and typically given a name Language: a system of symbols and rules for creating expressions from these symbols for the purpose of communicating information Description: an expression in some language that conveys information about a set of entities. The set being described is called the reference set. A concept description describes all entities belonging to the concept (concept instances) Generalization: a process of extending the reference set of a description, or its result Abstraction: a process of reducing information about a reference set, or its result

System Specific Database Attribute Types For example, in SQL Server 2012:

Numeric and Date

Strings and Other

Symbolic Analytic Data Types Symbols used to represent entities Numeric Numbers, usually used for calculations

Analytic Attribute Types

Extract, Transform, Load ETL is almost always used in context of data warehouses, but also applied to data mining Extract data from external sources (often many) Transform into uniform representation Load into the target system (DW, DM)

ETL in Context Flat Files EMR Reporting Rx Extract Transform Load Data Warehouse Data Mining Billing Analysis PACS

ETL in Context Flat Files EMR Rx Extract Transform Load Flat file ready for Data Mining Data Mining Billing PACS

File Viewer Tools to Have Text file editor, Editpad Pro, Notepad++ Not word processor! Processing very large text files awk, sed, grep,. File converters, built in software or not lots of free ones...

HAP 780 Janusz Wojtusiak, PhD George Mason University jwojtusi@gmu.edu