Introduction to Data Mining S L I D E S B Y : S H R E E J A S W A L

Similar documents
COMP 465 Special Topics: Data Mining

Data Mining. Chapter 1: Introduction. Adapted from materials by Jiawei Han, Micheline Kamber, and Jian Pei

Knowledge Discovery and Data Mining

CS423: Data Mining. Introduction. Jakramate Bootkrajang. Department of Computer Science Chiang Mai University

CSE-4412: Data Mining

Winter Semester 2009/10 Free University of Bozen, Bolzano

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 1

Data warehouse and Data Mining

Data Mining: Concepts and Techniques

CS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University

Introduction to Data Mining

CSE5243 INTRO. TO DATA MINING

DATA MINING II - 1DL460

Data Mining. Yi-Cheng Chen ( 陳以錚 ) Dept. of Computer Science & Information Engineering, Tamkang University

cse643 Data Mining Professor Anita Wasilewska Computer Science Department Stony Brook University

Chapter 1, Introduction

DATA MINING II - 1DL460

Question Bank. 4) It is the source of information later delivered to data marts.

Concepts and Techniques. Data Mining: Slides related to: University of Illinois at Urbana-Champaign

What is Data Mining?

cse643, cse590 Data Mining Anita Wasilewska Computer Science Department Stony Brook University Stony Brook NY 11794

Data Mining Course Overview

INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad

Data Mining and Analytics. Introduction

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Big Data Analytics The Data Mining process. Roger Bohn March. 2016

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATIONS (MCA) Semester: IV

Knowledge Discovery. URL - Spring 2018 CS - MIA 1/22

Summary of Last Class. Course Content. Chapter 1 Objectives

Data Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA.

D B M G Data Base and Data Mining Group of Politecnico di Torino

INTRODUCTION TO DATA MINING

1. Inroduction to Data Mininig

Applications and Trends in Data Mining

Table Of Contents: xix Foreword to Second Edition

Knowledge Discovery. Javier Béjar URL - Spring 2019 CS - MIA

Data Mining Technology Based on Bayesian Network Structure Applied in Learning

WKU-MIS-B10 Data Management: Warehousing, Analyzing, Mining, and Visualization. Management Information Systems

INTRODUCTION TO BIG DATA, DATA MINING, AND MACHINE LEARNING

Contents. Foreword to Second Edition. Acknowledgments About the Authors

CS249: ADVANCED DATA MINING

Knowledge Modelling and Management. Part B (9)

Data mining fundamentals

CS 412 Intro. to Data Mining

Data Mining. Prof. Jiawei Han of UIUC

Knowledge Discovery in Data Bases

This tutorial will help computer science graduates to understand the basic-to-advanced concepts related to data warehousing.

An Introduction to Data Mining BY:GAGAN DEEP KAUSHAL

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Thanks to the advances of data processing technologies, a lot of data can be collected and stored in databases efficiently New challenges: with a

UNIT-IV DATA MINING ALGORITHMS. March 14, 2012 Prof. Asha Ambhaikar

Knowledge Discovery & Data Mining

CT75 (ALCCS) DATA WAREHOUSING AND DATA MINING JUN

ISM 50 - Business Information Systems

PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore

9. Conclusions. 9.1 Definition KDD

Computers Are Your Future

Data Mining and Warehousing

Data Mining & Machine Learning

Data Mining. Vera Goebel. Department of Informatics, University of Oslo

Introduction to Data Mining and Data Analytics

DATA WAREHOUSING AND MINING UNIT-V TWO MARK QUESTIONS WITH ANSWERS

CT75 DATA WAREHOUSING AND DATA MINING DEC 2015

DATA MINING AND WAREHOUSING

Study on the Application Analysis and Future Development of Data Mining Technology

International Journal of Advance Engineering and Research Development. A Survey on Data Mining Methods and its Applications

DATA WAREHOUING UNIT I

Fig 1.2: Relationship between DW, ODS and OLTP Systems

Chapter 1 Introduction to Data Mining

Big Data Analytics The Data Mining process. Roger Bohn March. 2017

General introduction. Marco Saerens (UCL), with Christine Decaestecker (ULB)

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

DATA MINING TRANSACTION

SCHEME OF COURSE WORK. Data Warehousing and Data mining

Data Mining. Introduction. Piotr Paszek. (Piotr Paszek) Data Mining DM KDD 1 / 44

Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies

Research on Data Mining Technology Based on Business Intelligence. Yang WANG

Time: 3 hours. Full Marks: 70. The figures in the margin indicate full marks. Answers from all the Groups as directed. Group A.

Data Mining and Data Warehousing Introduction to Data Mining

A SURVEY OF DATA MINING & ITS APPLICATIONS

The 2018 (14th) International Conference on Data Science (ICDATA)

CS590D: Data Mining Chris Clifton. What Is Data Mining?

An Improved Apriori Algorithm for Association Rules

Full file at

Data Mining Concept. References. Why Mine Data? Commercial Viewpoint. Why Mine Data? Scientific Viewpoint

Introduction. to Data Mining. Introduction. Motivation: Necessity is the Mother of Invention. Motivation: Necessity is the Mother of Invention

SIDDHARTH GROUP OF INSTITUTIONS :: PUTTUR Siddharth Nagar, Narayanavanam Road QUESTION BANK (DESCRIPTIVE)

1.1 What Motivated Data Mining? Why Is It Important?

Data Mining Concepts

Chapter 3: Data Mining:

Data Mining and Knowledge Management Process

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Oracle9i Data Mining. Data Sheet August 2002

An Indian Journal FULL PAPER. Trade Science Inc. Research on data mining clustering algorithm in cloud computing environments ABSTRACT KEYWORDS

Introduction. IST557 Data Mining: Techniques and Applications. Jessie Li, Penn State University

Lecture 18. Business Intelligence and Data Warehousing. 1:M Normalization. M:M Normalization 11/1/2017. Topics Covered

Data warehouses Decision support The multidimensional model OLAP queries

Code No: R Set No. 1

Transcription:

Introduction to Data Mining S L I D E S B Y : S H R E E J A S W A L

Books 2 Which Chapter from which Text Book? Chapter 1: Introduction from Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3nd Edition

Course objective and Course Outcome Course objective 1: To introduce the concept of data Mining as an important tool for enterprise data management and as a cutting edge technology for building competitive advantage. Course Outcome TEITC604.1: Demonstrate an understanding of the importance of data mining and the principles of business intelligence 3

Topics to be covered What is Data Mining; Kind of patterns to be mined; Technologies used; Major issues in Data Mining 4

Evolution of database system technology 5

Data Warehouse Definition A data warehouse is a subject-oriented integrated time-varying non-volatile collection of data that is used primarily in organizational decision making. 6 -- Bill Inmon, Building the Data Warehouse 1996

What is a data warehouse? 7

Example of a datawarehouse framework used by an electronics store 8

9

Why Data Mining? The Explosive Growth of Data: from terabytes to petabytes Data collection and data availability Automated data collection tools, database systems, Web, computerized society Major sources of abundant data Business: Web, e-commerce, transactions, stocks, Science: Remote sensing, bioinformatics, scientific simulation, Society and everyone: news, digital cameras, YouTube We are drowning in data, but starving for knowledge! Necessity is the mother of invention Data mining Automated analysis of massive data sets 10

Why Data Mining? Potential Applications Data analysis and decision support 11 Market analysis and management Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation Risk analysis and management Forecasting, customer retention, improved underwriting, quality control, competitive analysis Fraud detection and detection of unusual patterns (outliers) Other Applications Text mining (news group, email, documents) and Web mining Stream data mining Bioinformatics and bio-data analysis

Ex. 1: Market Analysis and Management Where does the data come from? Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies Target marketing Find clusters of model customers who share the same characteristics: interest, income level, spending habits, etc. 12 Determine customer purchasing patterns over time Cross-market analysis Find associations/co-relations between product sales, & predict based on such association Customer profiling What types of customers buy what products (clustering or classification) Customer requirement analysis Identify the best products for different groups of customers Predict what factors will attract new customers Provision of summary information Multidimensional summary reports Statistical summary information (data central tendency and variation)

Ex. 2: Corporate Analysis & Risk Management Finance planning and asset evaluation cash flow analysis and prediction contingent claim analysis to evaluate assets cross-sectional and time series analysis (financial-ratio, trend analysis, etc.) Resource planning summarize and compare the resources and spending Competition monitor competitors and market directions group customers into classes and a class-based pricing procedure set pricing strategy in a highly competitive market 13

Ex. 3: Fraud Detection & Mining Unusual Patterns Approaches: Clustering & model construction for frauds, outlier analysis Applications: Health care, retail, credit card service, telecomm. Auto insurance: ring of collisions Money laundering: suspicious monetary transactions Medical insurance Professional patients, ring of doctors, and ring of references 14 Unnecessary or correlated screening tests Telecommunications: phone-call fraud Phone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm Retail industry Analysts estimate that 38% of retail shrink is due to dishonest employees Anti-terrorism

What Is Data Mining? Data mining (knowledge discovery from data) 15 Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data Data mining: a misnomer? Alternative names Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. Watch out: Is everything data mining? Simple search and query processing (Deductive) expert systems

Examples: What is (not) Data Mining? What is not Data Mining? Look up phone number in phone directory Query a Web search engine for information about Amazon What is Data Mining? Certain names are more prevalent in certain Mumbai locations (D Souza, D Mello, D Silva in Vasai area) Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,)

KDD Process: Several Key Steps Learning the application domain relevant prior knowledge and goals of application 17 Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation Find useful features, dimensionality/variable reduction, invariant representation Choosing functions of data mining summarization, classification, regression, association, clustering Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge

Knowledge Discovery (KDD) Process This is a view from typical database systems and data Pattern Evaluation warehousing communities Data mining plays an essential role in the knowledge discovery process Data Mining 18 Task-relevant Data Data Warehouse Selection and transformation Data Cleaning Data Integration Databases

Example: A Web Mining Framework Web mining usually involves Data cleaning Data integration from multiple sources Warehousing the data Data cube construction Data selection for data mining Data mining Presentation of the mining results 19 Patterns and knowledge to be used or stored into knowledgebase

Data Mining in Business Intelligence 20 Increasing potential to support business decisions Decision Making Data Presentation Visualization Techniques Data Mining Information Discovery End User Business Analyst Data Analyst Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems DBA

KDD Process: A Typical View from ML and Statistics 21 Input Data Data Pre- Processing Data Mining Post- Processing Data integration Normalization Feature selection Dimension reduction Pattern discovery Association & correlation Classification Clustering Outlier analysis Pattern evaluation Pattern selection Pattern interpretation Pattern visualization This is a view from typical machine learning and statistics communities

Decisions in Data Mining(1) Data to be mined 22 Database data (extended-relational, object-oriented, heterogeneous, legacy), data warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media, graphs & social and information networks Knowledge to be mined (or: Data mining functions) Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc. Descriptive vs. predictive data mining : Descriptive mining tasks characterize properties of data in a target data set & predictive mining tasks perform induction on current data in order to make predictions Multiple/integrated functions and mining at multiple levels

Decisions in Data Mining(2) Techniques utilized 23 Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition, visualization, high-performance, etc. Applications adapted Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.

What Kinds of Patterns Can Be Mined? 1. Generalization 2. Association and Correlation Analysis 3. Classification 4. Cluster Analysis 5. Outlier Analysis 24

Data Mining Function: (1) Generalization Multidimensional concept description: Characterization and discrimination 25 Generalize, summarize, and contrast data characteristics, e.g., summarize the characteristics of customers who spend more than Rs. 50,000 a year at an electronics store Data characterization is a summarization of the general characteristics or features of a target class of data Data cube technology for computing Scalable methods for computing (i.e., materializing) multidimensional aggregates OLAP (online analytical processing) Examples of Output forms : pie charts, MDD cubes, bar charts, curves etc.

Data Mining Function: (1) Generalization contd. 26 Data discrimination is a comparison of the general features of the target class data objects against the general features of objects from one or multiple contrasting classes Eg. Compare 2 groups of customers- those who shop for computer products regularly(more than twice a month) and those who rarely shop for such products(less than 3 times a year) Data cube technology for computing Drill down on any dimension Discriminant rules: Discrimination descriptions expressed in the form of rules Output forms : same as that of data characterization along with discrimination descriptions

Data Mining Function: (2) Association and Correlation Analysis Frequent patterns (or frequent itemsets) 27 What items are frequently purchased together in your mart? Eg. Milk & bread Association, correlation vs. causality A typical association rule Computer software [1%, 50%] (support, confidence) Confidence means that if one buys a computer there is a 50% chance that she will buy software too. A 1% support means that 1% of all transactions under analysis show that computer & software are purchased together Association rules are discarded as uninteresting if they do not satisfy both a minimum support threshold and a minimum confidence threshold

Data Mining Function: (3) Classification Classification and label prediction Construct models (functions) based on some training examples 28 Describe and distinguish classes or concepts for future prediction E.g., classify countries based on (climate), or classify cars based on (gas mileage) Predict some unknown class labels Typical methods Decision trees, naïve Bayesian classification, support vector machines, neural networks, rule-based classification, pattern-based classification, logistic regression, Typical applications: Credit card fraud detection, direct marketing, classifying stars, diseases, web-pages,

Various forms of a classification model 29

Data Mining Function: (4) Cluster Analysis Unsupervised learning (i.e., Class label is unknown) Group data to form new categories (i.e., clusters), e.g., cluster houses to find distribution patterns 30 Data objects are clustered or grouped based on the principle of maximizing intraclass similarity and minimizing interclass similarity Many methods and applications

A 2D plot of customer data with respect to customer locations in a city showing 3 data clusters 31

Data Mining Function: (5) Outlier Analysis Outlier analysis (anomaly mining) 32 Outlier: A data object that does not comply with the general behavior of the data Noise or exception? One person s garbage could be another person s treasure Methods: by product of clustering or regression analysis, Useful in fraud detection, rare events analysis

Are All the Discovered Patterns Interesting? Data mining may generate thousands of patterns: Not all of them are interesting Suggested approach: Human-centered, query-based, focused mining Interestingness measures 33 A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm Objective vs. subjective interestingness measures Objective: based on statistics and structures of patterns, e.g., support, confidence, etc. Subjective: based on user s belief in the data, e.g., unexpectedness, novelty, actionability, etc. Eg. A large Earthquake often follows a cluster of small quakes

Find All and Only Interesting Patterns? 34 Find all the interesting patterns: Completeness Can a data mining system find all the interesting patterns? Do we need to find all of the interesting patterns? Heuristic vs. exhaustive search Association vs. classification vs. clustering Search for only interesting patterns: An optimization problem Can a data mining system find only the interesting patterns? Approaches First generate all the patterns and then filter out the uninteresting ones Generate only the interesting patterns mining query optimization

Data Mining: Confluence of Multiple Disciplines 35 Machine Learning Pattern Recognition Statistics Applications Data Mining Visualization Algorithm Database Technology High-Performance Computing

Why Confluence of Multiple Disciplines? Tremendous amount of data 36 Algorithms must be highly scalable to handle such as tera-bytes of data High-dimensionality of data Micro-array may have tens of thousands of dimensions High complexity of data Data streams and sensor data Time-series data, temporal data, sequence data Structure data, graphs, social networks and multi-linked data Heterogeneous databases and legacy databases Spatial, spatiotemporal, multimedia, text and Web data Software programs, scientific simulations New and sophisticated applications

Applications of Data Mining 37 Web page analysis: from web page classification, clustering to PageRank & HITS algorithms Collaborative analysis & recommender systems Basket data analysis to targeted marketing Biological and medical data analysis: classification, cluster analysis (microarray data analysis), biological sequence analysis, biological network analysis Data mining and software engineering (e.g., IEEE Computer, Aug. 2009 issue) From major dedicated data mining systems/tools (e.g., SAS, MS SQL- Server Analysis Manager, Oracle Data Mining Tools) to invisible data mining

Major Issues in Data Mining (1) Mining Methodology Mining various and new kinds of knowledge Mining knowledge in multi-dimensional space Data mining: An interdisciplinary effort Boosting the power of discovery in a networked environment Handling noise, uncertainty, and incompleteness of data Pattern evaluation and pattern- or constraint-guided mining User Interaction Interactive mining Incorporation of background knowledge Presentation and visualization of data mining results 38

Major Issues in Data Mining (2) Efficiency and Scalability Efficiency and scalability of data mining algorithms Parallel, distributed, stream, and incremental mining methods Diversity of data types Handling complex types of data Mining dynamic, networked, and global data repositories Data mining and society Social impacts of data mining Privacy-preserving data mining Invisible data mining 39

Summary Data mining: Discovering interesting patterns and knowledge from 40 massive amount of data A natural evolution of database technology, in great demand, with wide applications A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation Mining can be performed in a variety of data Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc. Data mining technologies and applications Major issues in data mining