Data warehouse and Data Mining

Data warehouse and Data Mining Lecture No. 14 Data Mining and its techniques Naeem A. Mahoto Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro

Decision support progress to Data Mining Early Filebased Systems Database Systems Data Warehouse OLAP Systems Data Mining Applications Basic accounting data Operational systems data Data for decision Support Data for multi- Dimensional Analysis Selected and extracted data No Decision Support Primitive Decision Support True Decision Support Complex Analysis & Calculations Knowledge Discovery

Data Mining A non-trivial extraction of novel, implicit, and actionable knowledge from large databases Technology to enable data exploration, data analysis, and data visualization of very large databases at a high level of abstraction, without a specific hypothesis in mind

Data Mining: A KDD Process Pattern Evaluation Task-relevant Data Data Mining Data Warehouse Data Cleaning Selection Data Integration Databases

Data Mining: A KDD Process

Steps of KDD Process Learning the application domain: relevant prior knowledge and goals of application Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation: Find useful features, dimensionality/variable reduction, invariant representation Choosing functions of data mining summarization, classification, regression, association, clustering Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation visualization, transformation, removing redundant patterns, etc Use of discovered knowledge

Increasing potential to support business decisions Data Mining and Business Intelligence Making Decisions End User Data Presentation Visualization Techniques Data Mining Information Discovery Business Analyst Data Analyst Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA Data Sources Paper, Files, Information Providers, Database Systems, OLTP DBA

OLAP versus Data Mining OLAP In OLAP analysis session, analyst looks for some prior knowledge OLAP helps the user to analyze the past and gain insights Data Mining In data mining, the analyst has no prior knowledge of what results are likely to be Data Mining helps the user predict the future In OLAP, the analyst drives the process while using OLAP tools In data mining, the analyst prepares the data and sits back while the tools drive the process Complex Queries No SQL Queries

OLAP versus Data Mining Features Motivation for Information request Data granularity Number of business dimension Number of dimension attributes Sizes of datasets for the dimensions Analysis approach Analysis techniques State of the technology OLAP What is happening in the enterprise? Summary data Limited number of dimensions Small number of attributes Not large for each dimension User-driven interactive analysis Multidimensional, drilldown, and slice & dice Mature & widely used Data Mining Predict the future based on why this is happening Detailed transaction-level data Large number of dimensions Many dimension Attributes Usually very large for each dimension Data-driven automatic knowledge discovery Prepare data, launch mining tool & sit back Still emerging

Data Mining Applications Database analysis and decision support Market analysis and management target marketing, customer relation management, market basket analysis, cross selling, market segmentation Risk analysis and management Forecasting, customer retention, improved underwriting, quality control, competitive analysis Fraud detection and management Other Applications Text mining (news group, email, documents) Stream data mining Web mining DNA data analysis

Data Mining Techniques Data mining covers a broad range of techniques including: Classification Clustering Sequential Pattern mining Association rule mining Many more These techniques consist of the specific algorithms

Association Rule Mining Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories Frequent pattern: pattern (set of items, sequence, etc.) that occurs frequently in a database Motivation: finding regularities in data What products were often purchased together? Beer and diapers?! What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to this new drug? Can one automatically classify web documents?

Association Rule Mining Itemset X={x 1,, x k } Find all the rules Xà Y with min confidence and support support, s, probability that a transaction contains X Y confidence, c, conditional probability that a transaction having X also contains Y. Let min_support = 50%, min_conf = 50%: A à C (50%, 66.7%) C à A (50%, 100%) Transaction-id Items bought 10 A, B, C 20 A, C 30 A, D 40 B, E, F Customer buys beer Customer buys both Customer buys diapers

Mining Association Rules an Example Transaction-id Items bought 10 A, B, C 20 A, C 30 A, D 40 B, E, F Min. support 50% Min. confidence 50% Frequent pattern Support {A} 75% {B} 50% {C} 50% {A, C} 50% For Example Rule: A C support = support({a} {C}) = 50% confidence = support({a} {C})/support({A}) = 66.6%

Classification and Prediction Finding models (functions) that describe and distinguish classes or concepts for future prediction E.g., classify countries based on climate, or classify cars based on gas mileage Presentation: decision-tree, classification rule, neural network Prediction: Predict some unknown or missing numerical values

Classification Process: Model Construction Training Data Classification Algorithms NAME RANK YEARS TENURED Mike Assistant Prof 3 no Mary Assistant Prof 7 yes Bill Professor 2 yes Jim Associate Prof 7 yes Dave Assistant Prof 6 no Anne Associate Prof 3 no Classifier (Model) IF rank = professor OR years > 6 THEN tenured = yes

Classification Process: Use the Model in Prediction Classifier Testing Data Unseen Data NAME RANK YEARS TENURED Tom Assistant Prof 2 no Merlisa Associate Prof 7 no George Professor 5 yes Joseph Assistant Prof 7 yes (Jeff, Professor, 4) Tenured?

Decision Trees Training set age income student credit_rating <=30 high no fair <=30 high no excellent 31!40 high no fair >40 medium no fair >40 low yes fair >40 low yes excellent 31!40 low yes excellent <=30 medium no fair <=30 low yes fair >40 medium yes fair <=30 medium yes excellent 31!40 medium no excellent 31!40 high yes fair >40 medium no excellent

Decision Trees age? <=30 overcast 30..40 >40 student? yes credit rating? no yes fair excellent no yes no yes

Cluster and outlier analysis Cluster Analysis Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns Clustering based on the principle: maximizing the intraclass similarity and minimizing the interclass similarity Outlier Analysis Outlier: a data object that does not comply with the general behavior of the data It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis

Clusters and Outliers Clusters Outliers

Sequential Pattern Mining Sequential Pattern Mining is the mining of frequently occurring ordered events or subsequences as pattern in sequence database A sequence database stores a number of records, where all records are sequences of ordered events, with or without concrete notions of time Sequential patterns are used for targeted marketing and customer retention

Terminology for Sequence Mining Itemset: non-empty set of items Sequence: Ordered list of itemsets Customer sequence: List of customer transactions ordered by increasing transaction time A customer supports a sequence if the sequence is contained in the customer-sequence Support for a sequence: Fraction of total customers that support a sequence Maximal sequence: A sequence that is not contained in any other sequence Closed sequence: A sequence which is composed of other small sequences

Example: Sequence A sequence database SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> A sequence : < (ef) (ab) (df) c b > An element may contain a set of items. Items within an element are unordered and we list them alphabetically. <a(bc)df> is a subsequence of <a(abc)(ac)d(cf)> Given support threshold min_sup =2, <(ab)c> is a sequential pattern

Terms Data scrubbing: A process to upgrade the quality of data before it is moved into a data warehouse Transient data: Data in which changes to existing records cause the previous version of the records to be eliminated