Data Preprocessing Part 1 HAP 780 Data Mining in Health Care Janusz Wojtusiak, PhD George Mason University Fall 2016
The world is full of obvious things which nobody by any chance ever observes. -Sherlock Holmes
Multiple sources Multiple formats Multiple representations Errors, noise Missing values Unnecessary attributes Not-representative data. and many more! Why Preprocessing?
Two Types of Preprocessing Before loading to database/software How to get data from multiple sources into database, data warehouse, or other format on which DM tools can be used. After loading to database/software This is what is typically covered by data preprocessing: data cleaning, transformation, reduction, discretization, normalization..
EHR systems Billing Surveys Reports Web Excel spreadsheets Sensors Sources of Data for Data Mining Sometimes we mine together data from multiple sources. Simply speaking, we want to be able to mine any data and all available data.
One-dimensional Forms of Data Signals from sensors (EKG, accelerometer, etc.) Two-dimensional Images Multidimensional Flat data tables (attribute-value pairs) Relational Databases Multimedia
Formats of Data Structured Tables Relational databases Non-relational/No-SQL databases Text Files (coma separated, special formats) XML Excel files SAS data files.
Unstructured Formats of Data Text files Websites Text fields in databases/structured data Speech Multimedia.
Dirty Data Noise Incompleteness Inconsistency
Dirty Data PTID DOB Age Sex ProvID Dx1 Dx2 Dx3 Dx4 Dx5 1 1/2/70 48 M 345 250.0 2 30 N 010.0 2 1/1/80 33 3456 487 34 487 Patient is suffering form berculosis 5 9/8/60 F 327.0 327.2 The following records are imported after January 473.0 6 8/8/54 M 320 250.0 487 296.7 361.0 E858 7 Unknow n M John Smith 8 25 F 377 150 151 038.9 How many problems are in this dataset?
Dealing with Dirty Data Load data to database Data Types Obvious problems in data files Data cleaning & transformation Inconsistencies, missing values, sampling, attribute selection, discretization,. http://www.prosoftsolutions.net/blog/bid/146041/dirty-data-what-is-it-how-does-it-cause-problems-andwhat-is-the-solution
Data Types Different names for the same Field: used in databases Attribute: used in data mining and machine learning Variable: used in statistics Feature: used in machine learning (usually means binary attribute) Database Attribute Types Analytic Attribute Types
Fundamental Concepts Symbol: a physical entity, its state, or its behavior that conveys a choice from a predefined set of choices. The choices may refer to any entities (physical or abstract objects), to their properties, or their actions. The choice indicated by a symbol is called its meaning Data: a recorded set of symbols characterizing a set of entities Information: interpreted data; data whose symbols have been assigned meaning Knowledge: information that is verified to be true or true to some degree, which can be obtained by direct observation or by inference Belief: hypothetical knowledge; knowledge that has not been validated, but is characterized by some measure of it s the relationship to the reality it describes.
Belief Knowledge Information Data Symbols
Fundamental Concepts Concept: a set of entities considered as a unit, and typically given a name Language: a system of symbols and rules for creating expressions from these symbols for the purpose of communicating information Description: an expression in some language that conveys information about a set of entities. The set being described is called the reference set. A concept description describes all entities belonging to the concept (concept instances) Generalization: a process of extending the reference set of a description, or its result Abstraction: a process of reducing information about a reference set, or its result
System Specific Database Attribute Types For example, in SQL Server 2012:
Numeric and Date
Strings and Other
Symbolic Analytic Data Types Symbols used to represent entities Numeric Numbers, usually used for calculations
Analytic Attribute Types
Extract, Transform, Load ETL is almost always used in context of data warehouses, but also applied to data mining Extract data from external sources (often many) Transform into uniform representation Load into the target system (DW, DM)
ETL in Context Flat Files EMR Reporting Rx Extract Transform Load Data Warehouse Data Mining Billing Analysis PACS
ETL in Context Flat Files EMR Rx Extract Transform Load Flat file ready for Data Mining Data Mining Billing PACS
File Viewer Tools to Have Text file editor, Editpad Pro, Notepad++ Not word processor! Processing very large text files awk, sed, grep,. File converters, built in software or not lots of free ones...
HAP 780 Janusz Wojtusiak, PhD George Mason University jwojtusi@gmu.edu