Database and Knowledge-Base Systems: Data Mining. Martin Ester

Similar documents
9. Conclusions. 9.1 Definition KDD

Data Mining. Chapter 1: Introduction. Adapted from materials by Jiawei Han, Micheline Kamber, and Jian Pei

Data Mining Course Overview

Thanks to the advances of data processing technologies, a lot of data can be collected and stored in databases efficiently New challenges: with a

CS423: Data Mining. Introduction. Jakramate Bootkrajang. Department of Computer Science Chiang Mai University

Chapter 1, Introduction

Knowledge Discovery and Data Mining

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

Dynamic Data in terms of Data Mining Streams

D B M G Data Base and Data Mining Group of Politecnico di Torino

2. Data Preprocessing

COMP 465 Special Topics: Data Mining

Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams

INTRODUCTION TO DATA MINING

DATA MINING II - 1DL460

Data mining fundamentals

Introduction to Data Mining

Overview. Data-mining. Commercial & Scientific Applications. Ongoing Research Activities. From Research to Technology Transfer

DATA MINING II - 1DL460

Chapter 3: Data Mining:

An Improved Apriori Algorithm for Association Rules

INTRODUCTION TO BIG DATA, DATA MINING, AND MACHINE LEARNING

Data Mining and Data Warehousing Introduction to Data Mining

3. Data Preprocessing. 3.1 Introduction

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 1

Introduction to Data Mining S L I D E S B Y : S H R E E J A S W A L

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

An overview of Graph Categories and Graph Primitives

Data Mining. Yi-Cheng Chen ( 陳以錚 ) Dept. of Computer Science & Information Engineering, Tamkang University

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Table Of Contents: xix Foreword to Second Edition

Data Preprocessing. Slides by: Shree Jaswal

DOI:: /ijarcsse/V7I1/0111

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATIONS (MCA) Semester: IV

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA

Development of Efficient & Optimized Algorithm for Knowledge Discovery in Spatial Database Systems

Fall Principles of Knowledge Discovery in Databases. University of Alberta

CS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1395

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1394

Basic Data Mining Technique

K-Mean Clustering Algorithm Implemented To E-Banking

A Review on Cluster Based Approach in Data Mining

OUTLIER MINING IN HIGH DIMENSIONAL DATASETS

Foundation of Data Mining: Introduction

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3

Latent Space Model for Road Networks to Predict Time-Varying Traffic. Presented by: Rob Fitzgerald Spring 2017

CSE-4412: Data Mining

Data Exploration and Preparation Data Mining and Text Mining (UIC Politecnico di Milano)

UNIT 2. DATA PREPROCESSING AND ASSOCIATION RULES

COMP90049 Knowledge Technologies

DATA WAREHOUING UNIT I

Data Mining: Dynamic Past and Promising Future

Course on Data Mining ( )

The Data Mining usage in Production System Management

Question Bank. 4) It is the source of information later delivered to data marts.

CS570: Introduction to Data Mining

Information mining and information retrieval : methods and applications

A Comparative Study of Data Mining Process Models (KDD, CRISP-DM and SEMMA)

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

CT75 (ALCCS) DATA WAREHOUSING AND DATA MINING JUN

2. Department of Electronic Engineering and Computer Science, Case Western Reserve University

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Epilog: Further Topics

The 2018 (14th) International Conference on Data Science (ICDATA)

745: Advanced Database Systems

Clustering Part 4 DBSCAN

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4

International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16

CSE 626: Data mining. Instructor: Sargur N. Srihari. Phone: , ext. 113

1. Inroduction to Data Mininig

Knowledge Discovery. Javier Béjar URL - Spring 2019 CS - MIA

Data Mining and Analytics. Introduction

Data Mining: Concepts and Techniques

This proposed research is inspired by the work of Mr Jagdish Sadhave 2009, who used

Code No: R Set No. 1

An Indian Journal FULL PAPER. Trade Science Inc. Research on data mining clustering algorithm in cloud computing environments ABSTRACT KEYWORDS

Research on Data Mining Technology Based on Business Intelligence. Yang WANG

Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering

Knowledge Discovery and Data Mining

CSE4334/5334 Data Mining 4 Data and Data Preprocessing. Chengkai Li University of Texas at Arlington Fall 2017

A Rough Set Approach for Generation and Validation of Rules for Missing Attribute Values of a Data Set

Publishing CitiSense Data: Privacy Concerns and Remedies

Contents. Preface to the Second Edition

Ubiquitous Computing and Communication Journal (ISSN )

Comparative Study of Subspace Clustering Algorithms

Applications and Trends in Data Mining

SCHEME OF COURSE WORK. Data Warehousing and Data mining

COURSE PLAN. Computer Science & Engineering

Fall 2017 ECEN Special Topics in Data Mining and Analysis

Keywords Fuzzy, Set Theory, KDD, Data Base, Transformed Database.

Research on outlier intrusion detection technologybased on data mining

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

University of Florida CISE department Gator Engineering. Clustering Part 4

Machine Learning & Data Mining

Application of Clustering as a Data Mining Tool in Bp systolic diastolic

Large Scale Data Analysis for Policy

Transcription:

Database and Knowledge-Base Systems: Data Mining Martin Ester Simon Fraser University School of Computing Science Graduate Course Spring 2006 CMPT 843, SFU, Martin Ester, 1-06 1

Introduction [Fayyad, Piatetsky-Shapiro & Smyth 96] Knowledge discovery in databases (KDD) is the process of (semi-)automatic extraction of knowledge from databases which is valid previously unknown and potentially useful. Remarks (semi)-automatic: distinction from manual analysis / OLAP. Typically, some user interaction necessary. valid: in the statistical sense. previously unknown: not explicit, no common sense knowledge. potentially useful: for some given application. CMPT 843, SFU, Martin Ester, 1-06 2

Introduction Statistics [Hand, Mannila & Smyth 2001] representation of uncertainty model-based inferences focus on numeric data Machine Learning [Mitchell 1997] knowledge representation search strategies focus on symbolic data Database Systems [Han & Kamber 2000] data management integration of data mining with DBS scalability for large databases CMPT 843, SFU, Martin Ester, 1-06 3

Introduction KDD Process [Han & Kamber 2000] Task-relevant Data Data Mining Knowledge Pattern Evaluation Data Warehouse Selection Data Cleaning Databases Data Integration KDD Process [Fayyad, Piatetsky-Shapiro & Smyth 1996] Focussing Preprocessing Transformation Data Mining Evaluation Database Pattern Knowledge CMPT 843, SFU, Martin Ester, 1-06 4

Data Mining Definition [Fayyad, Piatetsky-Shapiro, Smyth 1996] Data Mining is the application of efficient algorithms to determine the patterns contained in some database. Data-Mining Tasks clustering a a a a b b b a b b classification b a A and B C association rules generalisation other tasks: regression, outlier detection... CMPT 843, SFU, Martin Ester, 1-06 5

Trends in KDD Research KDD 2000 Conference New Data Mining Algorithms Efficiency and Scalability of Data Mining Algorithms Interactive Data Exploration Visualization Constraints and Evaluation in the KDD Process CMPT 843, SFU, Martin Ester, 1-06 6

Trends in KDD Research KDD 2002 Conference Statistical Methods Frequent Patterns Streams and Time Series Visualization Web Search and Navigation Text and Web Page Classification Intrusion and Privacy Applications CMPT 843, SFU, Martin Ester, 1-06 7

Trends in KDD Research KDD 2004 Conference Frequent Patterns / Association Rules Clustering Mining Spatio-Temporal Data Mining Data Streams Dimensionality Reduction Privacy-Preserving Data Mining Mining Biological Data Applications (Web, biological data, security,...) CMPT 843, SFU, Martin Ester, 1-06 8

Trends in KDD Research KDD 2005 Conference Clustering Privacy Mining Spatio-Temporal Data Mining Data Streams SVMs Text and Web Mining Mining (Social) Networks Graph Mining (best paper on graphs over time) CMPT 843, SFU, Martin Ester, 1-06 9

Increasing Importance Trends in KDD Research Mining data streams Clustering high-dimensional data Mining spatio-temporal data Privacy-preserving data mining Network analysis Graph mining Multi-relational data mining CMPT 843, SFU, Martin Ester, 1-06 10

Prerequisites Overview of this Course Basics in database systems and statistics Introductory graduate data mining course Objectives Introduction into some hot topics of data mining research Introduction into some ongoing research projects of our DDM Lab General research methodology Presentation skills start thesis work after this class! CMPT 843, SFU, Martin Ester, 1-06 11

Overview of this Course Topics Clustering high-dimensional data Mining data streams Spatio-temporal data mining Multi-relational data mining Graph mining CMPT 843, SFU, Martin Ester, 1-06 12

Format Tutorial surveys Overview of this Course Research paper presentations (and discussions) Small research projects Grading Paper presentation Project presentation Project report originality, technical quality, presentation quality CMPT 843, SFU, Martin Ester, 1-06 13

Clustering High-Dimensional Data Applications Biological Data Micro-Array Data: rows = genes, columns = conditions / experiments, value measures the expression level of gene under given condition Often: thousands of columns Co-regulated genes: similar expression levels in a subset of all conditions Text / Web Data Text / web document: attributes = term frequencies Typically, >> 1000 relevant terms Document clusters: document sets that share some important terms CMPT 843, SFU, Martin Ester, 1-06 14

Clustering High-Dimensional Data Curse of Dimensionality The more dimensions, the larger the (average) pairwise distances Clusters only in lower-dimensional subspaces clusters only in 1-dimensional subspace salary CMPT 843, SFU, Martin Ester, 1-06 15

Clustering High-Dimensional Data Approaches In approach1, cluster: dense connected region in data space Find interesting subspaces, then clusters within these subspaces density threshold hard to determine (should be different) clusters highly overlapping In approach 2, start with full-dimensional clustering and iteratively refine the clusters and relevant cluster dimensions result ill-defined number of clusters / cluster dimensions hard to determine CMPT 843, SFU, Martin Ester, 1-06 16

Telecommunications Mining Data Streams Applications o Telecommunications providers collect call records (from, to, when, how long,...) o Want to use the data not only for billing, but also for analysis (monitor trends in usage, customer segmentation, campaign design,...) Sensor networks o Network of distributed sensors measuring several parameters such as precipitation, temperature, amount of traffic, blood pressure,... o Data need to be monitored and analyzed on-line (immediate response) CMPT 843, SFU, Martin Ester, 1-06 17

Characteristics of data streams o Massive volumes of data o Records arrive at a rapid rate Requirements Mining Data Streams Challenges o Main memory to small to store all records o Each record is examined at most once o Real time response, i.e. very efficient processing CMPT 843, SFU, Martin Ester, 1-06 18

Mining Data Streams Approach Main Memory Synopsis Data Stream 1... Data Stream m Stream Processing Engine (Approximate) Answer Summarize using samples, histograms or novel methods such as CF-trees How to maximize the approximation accuracy? How to exploit the temporal dimension (aging of data)? CMPT 843, SFU, Martin Ester, 1-06 19

Spatio-Temporal Data Mining Applications Geo-marketing Purchasing patterns for particular geographical areas (e.g., for choice of store location) Health care data analysis Analysis of the spread of diseases Interventions by Public Health Authorities Data referencing the earth surface (spatial) and the time (temporal) CMPT 843, SFU, Martin Ester, 1-06 20

Spatio-Temporal Data Mining Challenges Independence assumption no longer valid Attribute values of neighboring objects are typically correlated Operations on spatial data are very expensive Spatial objects are complex (lines, polygons, 3D surfaces,...) which makes the corresponding operations very expensive Temporal dimension Blows up the pattern search space What patterns do we really want to find in spatio-temporal DB? CMPT 843, SFU, Martin Ester, 1-06 21

Spatio-Temporal Data Mining Consider spatial auto-correlation Approaches Find only patterns that deviate from what is expected according to spatial auto-correlation Efficient support by the DBMS Indexes, basic operations,... Models for spatio-temporal data mining Definition of new pattern types such as spatio-temporal trends CMPT 843, SFU, Martin Ester, 1-06 22

Mining biological data Multi-Relational Data Mining Applications o Molecular biologists collect data on genes, proteins, gene expression, metabolic pathways,... o Want to learn, e.g., about the process of gene regulation Text mining o Using information extraction methods, entities (companies, persons, genes,...) and their relationships (directs, married, regulates,...) can be extracted from a text document o Can be used as input for true text mining: finding knowledge rather than documents CMPT 843, SFU, Martin Ester, 1-06 23

Multi-Relational Data Mining Limitations of Existing Methods Emerging applications are inherently multi-relational o Input: multiple tables (entity sets) and their relationships o Record characteristics: own attributes, related records from other tables and the attributes of these related records Existing data mining methods are single-relational o Input: a single table (relation), Output: refers to attributes of a single table o Data representation as a universal relation (single table) is possible, but may loose a lot of information propositional logic CMPT 843, SFU, Martin Ester, 1-06 24

Multi-Relational Data Mining Inductive Logic Programming Approaches o Logic program: facts (records) and deduction rules (background knowledge) o Task: find (first order) logic rules with some target predicate in the conclusion o Restrict search space by user-specified (syntactic) constraints huge search space syntactic constraints are hard to define only for classification tasks CMPT 843, SFU, Martin Ester, 1-06 25

Multi-Relational Data Mining Approaches First-order versions of standard data mining algorithms o Multi-relational decision trees o Multi-relational association rules What rule format / semantics (in particular, aggregation operations)? Multi-relational distances o Family of distance functions with different depths, taking into account attributes of related records up to the given depth o Standard methods can be applied, e.g. k-means or k-nn classification (global) distance function looses a lot of information CMPT 843, SFU, Martin Ester, 1-06 26

Analysis of the internet Graph Mining Applications o What are the most important web pages? o How will the internet / web look like next year? Social network analysis o What customers should be targeted to maximize the profit of a marketing campaign? o Whom to immunize in order to stop spread of some virus? o Find abnormal subgraphs (e.g., criminal rings). CMPT 843, SFU, Martin Ester, 1-06 27

Graph Mining Challenges Definition of new types of patterns o Certain subgraphs... o Which ones are interesting in a given application? Complexity o Many graph algorithms are NP-complete. o Real graphs tend to be extremely large. Need efficient algorithms Dynamics o Many networks evolve rapidly. CMPT 843, SFU, Martin Ester, 1-06 28

References Text Books Han J., Kamber M., Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, 2000. Hand D., Mannila H., Smyth P. Principles of Data Mining, MIT Press, 2001. Mitchell T. M., Machine Learning, McGraw-Hill, 1997. CMPT 843, SFU, Martin Ester, 1-06 29