Data Mining. Yi-Cheng Chen ( 陳以錚 ) Dept. of Computer Science & Information Engineering, Tamkang University

Similar documents
Knowledge Discovery and Data Mining

INTRODUCTION TO DATA MINING

Statistical Learning and Data Mining CS 363D/ SSC 358

Data Mining Course Overview

Foundation of Data Mining: Introduction

An Introduction to Data Mining BY:GAGAN DEEP KAUSHAL

Chapter 1, Introduction

DATA MINING II - 1DL460

Data Mining Concept. References. Why Mine Data? Commercial Viewpoint. Why Mine Data? Scientific Viewpoint

DATA MINING II - 1DL460

COMP90049 Knowledge Technologies

COMP 465 Special Topics: Data Mining

Stats Overview Ji Zhu, Michigan Statistics 1. Overview. Ji Zhu 445C West Hall

Introduction to Data Mining S L I D E S B Y : S H R E E J A S W A L

INTRODUCTION TO BIG DATA, DATA MINING, AND MACHINE LEARNING

Winter Semester 2009/10 Free University of Bozen, Bolzano

CS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

CISC 4631 Data Mining Lecture 01:

Data Mining. Chapter 1: Introduction. Adapted from materials by Jiawei Han, Micheline Kamber, and Jian Pei

CSE-4412: Data Mining

Overview. Data-mining. Commercial & Scientific Applications. Ongoing Research Activities. From Research to Technology Transfer

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 1

Data Mining: Introduction. Lecture Notes for Chapter 1. Introduction to Data Mining

D B M G Data Base and Data Mining Group of Politecnico di Torino

Knowledge Discovery. Javier Béjar URL - Spring 2019 CS - MIA

Introduction to Data Mining CS 584 Data Mining (Fall 2016)

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

Data Mining & Feature Selection

Thanks to the advances of data processing technologies, a lot of data can be collected and stored in databases efficiently New challenges: with a

Data mining fundamentals

Data warehouse and Data Mining

1. Inroduction to Data Mininig

Knowledge Discovery. URL - Spring 2018 CS - MIA 1/22

Data Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA.

Includes Review of Syllabus OVERVIEW OF THE CLASS

CSE5243 INTRO. TO DATA MINING

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Data Mining Concepts

Data Mining and Analytics. Introduction

Chapter 3: Data Mining:

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATIONS (MCA) Semester: IV

PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore

9. Conclusions. 9.1 Definition KDD

Data Mining: Concepts and Techniques

Data Mining. Introduction. Piotr Paszek. (Piotr Paszek) Data Mining DM KDD 1 / 44

CS423: Data Mining. Introduction. Jakramate Bootkrajang. Department of Computer Science Chiang Mai University

Dynamic Data in terms of Data Mining Streams

Data Mining & Machine Learning

Question Bank. 4) It is the source of information later delivered to data marts.

INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad

Introduction to Data Mining. Komate AMPHAWAN

An Improved Apriori Algorithm for Association Rules

Knowledge Modelling and Management. Part B (9)

Product presentations can be more intelligently planned

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

DATA WAREHOUING UNIT I

Overview of Web Mining Techniques and its Application towards Web

Introduction to Data Mining

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Vera Goebel. Department of Informatics, University of Oslo

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

Summary of Last Class. Course Content. Chapter 1 Objectives

International Journal of Advance Engineering and Research Development. A Survey on Data Mining Methods and its Applications

Code No: R Set No. 1

Knowledge Discovery & Data Mining

WKU-MIS-B10 Data Management: Warehousing, Analyzing, Mining, and Visualization. Management Information Systems

Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies

CS249: ADVANCED DATA MINING

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA

INTRODUCTION TO DATA MINING ASSOCIATION RULES. Luiza Antonie

Data Mining and Data Warehousing Introduction to Data Mining

ISM 50 - Business Information Systems

Applications and Trends in Data Mining

Introduction to Data Mining and Data Analytics

Knowledge Discovery in Data Bases

745: Advanced Database Systems

DATA MINING TRANSACTION

Table Of Contents: xix Foreword to Second Edition

cse643 Data Mining Professor Anita Wasilewska Computer Science Department Stony Brook University

Data Mining Technology Based on Bayesian Network Structure Applied in Learning

Big Data Analytics The Data Mining process. Roger Bohn March. 2016

Overview. Introduction to Data Warehousing and Business Intelligence. BI Is Important. What is Business Intelligence (BI)?

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset

CT75 DATA WAREHOUSING AND DATA MINING DEC 2015

CS 412 Intro. to Data Mining

Clustering Analysis based on Data Mining Applications Xuedong Fan

DATA WAREHOUSING IN LIBRARIES FOR MANAGING DATABASE

International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16

Time: 3 hours. Full Marks: 70. The figures in the margin indicate full marks. Answers from all the Groups as directed. Group A.

COMP 6838 Data MIning

Adnan YAZICI Computer Engineering Department

An Overview of various methodologies used in Data set Preparation for Data mining Analysis

Data warehouses Decision support The multidimensional model OLAP queries

Machine Learning & Data Mining

Research on Data Mining Technology Based on Business Intelligence. Yang WANG

3 Data, Data Mining. Chengkai Li

Transcription:

Data Mining Yi-Cheng Chen ( 陳以錚 ) Dept. of Computer Science & Information Engineering, Tamkang University

Why Mine Data? Commercial Viewpoint Lots of data is being collected and warehoused Web data, e-commerce Purchases at department/ grocery stores Bank/Credit Card transactions Computers have become cheaper and more powerful Competitive Pressure is Strong Provide better, customized services for an edge (e.g. in Customer Relationship Management)

Why Mine Data? Scientific Viewpoint Data collected and stored at enormous speeds (GB/hour) remote sensors on a satellite telescopes scanning the skies microarrays generating gene expression data scientific simulations generating terabytes of data Traditional techniques infeasible for raw data Data mining may help scientists in classifying and segmenting data in Hypothesis Formation

Motivation We are data rich but information poor 4

Data Mining We are buried in data, but looking for knowledge Data mining:knowledge discovery in databases Extraction of interesting knowledge (rules, regularities, patterns) from data in large databases 5

Architecture: Typical Data Mining System Graphical user interface Pattern evaluation Data mining engine Data cleaning data integration Database or data warehouse server Databases Data Warehouse Filtering Knowledge bases

Origins of Data Mining Draws ideas from machine learning/ai, pattern recognition, statistics, and database systems Traditional Techniques may be unsuitable due to Enormity of data High dimensionality of data Heterogeneous, distributed nature of data Statistics/ AI Data Mining Database systems Machine Learning/ Pattern Recognition Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004

Course Materials Introduction to data mining Mining association rules Mining sequential patterns Data classification Data clustering Web mining Stream data mining Mining in social network Big data/ cloud data mining 8

Textbook Introduction to Data Mining, Pang-Ning Tan, Michael Steinbach and Vipin Kumar, Addison-Wesley (Reference) Data Mining: Concepts and Techniques, J. Han and M. Kamber, Morgan Kaufmann Paper Collection 9

Evaluation of Database Technology 1960s: data collection, database creation 1970s: relational model 1980s: advanced data model 1990s: data mining & data warehousing, digital library, Web databases 2000s: Stream data management and mining, data mining with a variety of applications, Web technology and global information systems 2010s: Big data and cloud data mining 10

Data Mining We are buried in data, but looking for knowledge Data mining:knowledge discovery in databases Extraction of interesting knowledge (rules, regularities, patterns) from data in large databases 11

Why Data Mining? Potential Applications Data analysis and decision support Market analysis and management Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation Risk analysis and management Forecasting, customer retention, competitive analysis Fraud detection and detection of unusual patterns (outliers) 12

Why Data Mining? Potential Applications (cont d) Other Applications Text mining (news group, email, documents) and Web mining Stream data mining DNA and bio-data analysis Mining in social network Multimedia or sensor network 13

Notes Data mining is very application dependent Small team with good skills and domain knowledge Emerging issues: Journals: IEEE TKDE, ACM TKDD, KAIS Conferences: ACM SIGKDD, SIGMOD, CIKM IEEE ICDM, ICDE SDM, VLDB, PAKDD etc. 14

Data Warehousing An architectural foundation for decision support system, consisting of Integrated data, detailed and summarized data, historical data, and metadata Set up stages for effective data mining 15

OLAP On-Line Analytical Processing: simple data mining facility Responds to queries quickly A multidimensional, logical view of the data Interactive analysis of the data: drill down and roll-up, etc. 16

Data Cube Product City 家電 電腦 通訊 Q1 Q2 Q3 Q4 高雄台中新竹台北 Industry Region Year Category Country Quarter Product City Month Week Date Office Day 17

Knowledge Discovery from Databases Nontrivial process of extraction of valid (with some degree of certainty) novel (implicit, previously unknown) potential useful ultimately understandable patterns from large collection of data Pattern expression in languages describing subset of data model (structure) applicable to subset of data 18

Similar Terms of KDD Knowledge Discovery in Databases (KDD) Knowledge mining from databases Knowledge extraction Regularities Data analysis 19

KDD Process Selection & Transformation Data Cleaning & Integration Data Warehouse Data Mining Task Relevant Data Pattern Evaluation Database 20

Classification of DM Techniques What kinds of databases to work on What kind of knowledge to be mined What kind of techniques to be utilized 21

Databases to Work on Relational Transactional Object-oriented Spatial Temporal Multimedia 22

Data Mining Tasks Prediction Methods Use some variables to predict unknown or future values of other variables. Description Methods Find human-interpretable patterns that describe the data. From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004

Knowledge to Be Mined Association rules Classification Clustering Trend and deviation analysis Outlier 24

Association Rules Buy(bread) ^ Buy(milk) => Buy(butter) Age(20~29) ^ Income(20~30k) => Buy(CD player) 25

The Classification Problem (informal definition) Katydids Given a collection of annotated data. In this case 5 instances Katydids of and five of Grasshoppers, decide what type of insect the unlabeled example is. Grasshoppers Katydid or Grasshopper? 26

Classification Example Test Set Training Set Learn Classifier Model Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004

Classification Supervised classification Organizes data into given classes based on attribute values X<10 Yes No group 1 Y<5 group 2 group 3 28

Classification: Application Fraud Detection Goal: Predict fraudulent cases in credit card transactions. Approach: Use credit card transactions and the information on its account-holder as attributes. When does a customer buy, what does he buy, how often he pays on time, etc Label past transactions as fraud or fair transactions. This forms the class attribute. Learn a model for the class of the transactions. Use this model to detect fraud by observing credit card transactions on an account. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004

What is a natural grouping among these objects? Clustering is subjective Simpson's Family School Employees Females Males 30

Clustering Unsupervised classification Organizes data into classes based on attribute values y y x 31 x

Clustering: Application 1 Market Segmentation: Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix. Approach: Collect different attributes of customers based on their geographical and lifestyle related information. Find clusters of similar customers. Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004

Clustering: Application 2 Document Clustering: Goal: To find groups of documents that are similar to each other based on the important terms appearing in them. Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster. Gain: Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004

Time Series Analysis Trends analysis Regression Sequential patterns Similar sequences 34

Sequential Patterns Given is a set of objects, with each object associated with its own timeline of events, find rules that predict strong sequential dependencies among different events. (A B) (C) (D E) 35

Time Series Database Time series Financial, marketing & production: stock price, sales number Scientific: weather data, geological, astrophysics Time series DB Databases with many time series of real numbers $price $price 1 365 1 365 36

Time Series Database (cont d) Query in time series DB Searching for similar patterns Whole matching Subsequence matching Examples Identify companies with similar pattern of growth Determine products with similar selling patterns Discover stocks with similar movement in stock prices Find if a musical score is similar to one of the copyrighted scores 37

The similarity matching problem can come in two flavors I Query Q (template) 1: Whole Matching 1 6 2 3 4 7 8 9 C 6 is the best match. 5 10 Database C Given a Query Q, a reference database C and a distance measure, find the C i that best matches Q. 38

The similarity matching problem can come in two flavors II Query Q (template) 2: Subsequence Matching Database C The best matching subsection. Given a Query Q, a reference database C and a distance measure, find the location that best matches Q. Note that we can always convert subsequence matching to whole matching by sliding a window across the long sequence, and copying the window contents. 39

Regression Predict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency. Greatly studied in statistics, neural network fields. Examples: Predicting sales amounts of new product based on advetising expenditure. Time series prediction of stock market indices. 40

Deviation/Anomaly Detection Detect significant deviations from normal behavior Applications: Credit Card Fraud Detection Network Intrusion Detection Typical network traffic at University level may reach over 100 million connections per day Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004

Performance Measurement Efficiency Effectiveness (interestingness) Objective measures; based on statistics & structures of patterns e.g. support, confidence Subjective: based on user s beliefs in data e.g. unexpectedness, novelty 42

Interestingness A pattern is interesting if it is Easily understood by humans Valid on new or test data with some degree of certainty Potentially useful Validates some hypothesis that a user seeks to confirm 43

Techniques to Be Utilized Database-oriented Machine learning Neural network Fuzzy set Statistics Visualization 44

Features & Challenges of KDD Handling of different types of data Efficiency & scalability of data mining algorithm Usefulness, certainly & expressiveness of results Interactive mining at multiple abstraction levels Parallel & distributed data mining Protection of privacy & data security 45