Database and Knowledge-Base Systems: Data Mining. Martin Ester

Database and Knowledge-Base Systems: Data Mining Martin Ester Simon Fraser University School of Computing Science Graduate Course Spring 2006 CMPT 843, SFU, Martin Ester, 1-06 1

Introduction [Fayyad, Piatetsky-Shapiro & Smyth 96] Knowledge discovery in databases (KDD) is the process of (semi-)automatic extraction of knowledge from databases which is valid previously unknown and potentially useful. Remarks (semi)-automatic: distinction from manual analysis / OLAP. Typically, some user interaction necessary. valid: in the statistical sense. previously unknown: not explicit, no common sense knowledge. potentially useful: for some given application. CMPT 843, SFU, Martin Ester, 1-06 2

Introduction Statistics [Hand, Mannila & Smyth 2001] representation of uncertainty model-based inferences focus on numeric data Machine Learning [Mitchell 1997] knowledge representation search strategies focus on symbolic data Database Systems [Han & Kamber 2000] data management integration of data mining with DBS scalability for large databases CMPT 843, SFU, Martin Ester, 1-06 3

Introduction KDD Process [Han & Kamber 2000] Task-relevant Data Data Mining Knowledge Pattern Evaluation Data Warehouse Selection Data Cleaning Databases Data Integration KDD Process [Fayyad, Piatetsky-Shapiro & Smyth 1996] Focussing Preprocessing Transformation Data Mining Evaluation Database Pattern Knowledge CMPT 843, SFU, Martin Ester, 1-06 4

Data Mining Definition [Fayyad, Piatetsky-Shapiro, Smyth 1996] Data Mining is the application of efficient algorithms to determine the patterns contained in some database. Data-Mining Tasks clustering a a a a b b b a b b classification b a A and B C association rules generalisation other tasks: regression, outlier detection... CMPT 843, SFU, Martin Ester, 1-06 5

Trends in KDD Research KDD 2000 Conference New Data Mining Algorithms Efficiency and Scalability of Data Mining Algorithms Interactive Data Exploration Visualization Constraints and Evaluation in the KDD Process CMPT 843, SFU, Martin Ester, 1-06 6

Trends in KDD Research KDD 2002 Conference Statistical Methods Frequent Patterns Streams and Time Series Visualization Web Search and Navigation Text and Web Page Classification Intrusion and Privacy Applications CMPT 843, SFU, Martin Ester, 1-06 7

Trends in KDD Research KDD 2004 Conference Frequent Patterns / Association Rules Clustering Mining Spatio-Temporal Data Mining Data Streams Dimensionality Reduction Privacy-Preserving Data Mining Mining Biological Data Applications (Web, biological data, security,...) CMPT 843, SFU, Martin Ester, 1-06 8

Trends in KDD Research KDD 2005 Conference Clustering Privacy Mining Spatio-Temporal Data Mining Data Streams SVMs Text and Web Mining Mining (Social) Networks Graph Mining (best paper on graphs over time) CMPT 843, SFU, Martin Ester, 1-06 9

Increasing Importance Trends in KDD Research Mining data streams Clustering high-dimensional data Mining spatio-temporal data Privacy-preserving data mining Network analysis Graph mining Multi-relational data mining CMPT 843, SFU, Martin Ester, 1-06 10

Prerequisites Overview of this Course Basics in database systems and statistics Introductory graduate data mining course Objectives Introduction into some hot topics of data mining research Introduction into some ongoing research projects of our DDM Lab General research methodology Presentation skills start thesis work after this class! CMPT 843, SFU, Martin Ester, 1-06 11

Overview of this Course Topics Clustering high-dimensional data Mining data streams Spatio-temporal data mining Multi-relational data mining Graph mining CMPT 843, SFU, Martin Ester, 1-06 12

Format Tutorial surveys Overview of this Course Research paper presentations (and discussions) Small research projects Grading Paper presentation Project presentation Project report originality, technical quality, presentation quality CMPT 843, SFU, Martin Ester, 1-06 13

Clustering High-Dimensional Data Applications Biological Data Micro-Array Data: rows = genes, columns = conditions / experiments, value measures the expression level of gene under given condition Often: thousands of columns Co-regulated genes: similar expression levels in a subset of all conditions Text / Web Data Text / web document: attributes = term frequencies Typically, >> 1000 relevant terms Document clusters: document sets that share some important terms CMPT 843, SFU, Martin Ester, 1-06 14

Clustering High-Dimensional Data Curse of Dimensionality The more dimensions, the larger the (average) pairwise distances Clusters only in lower-dimensional subspaces clusters only in 1-dimensional subspace salary CMPT 843, SFU, Martin Ester, 1-06 15

Clustering High-Dimensional Data Approaches In approach1, cluster: dense connected region in data space Find interesting subspaces, then clusters within these subspaces density threshold hard to determine (should be different) clusters highly overlapping In approach 2, start with full-dimensional clustering and iteratively refine the clusters and relevant cluster dimensions result ill-defined number of clusters / cluster dimensions hard to determine CMPT 843, SFU, Martin Ester, 1-06 16

Telecommunications Mining Data Streams Applications o Telecommunications providers collect call records (from, to, when, how long,...) o Want to use the data not only for billing, but also for analysis (monitor trends in usage, customer segmentation, campaign design,...) Sensor networks o Network of distributed sensors measuring several parameters such as precipitation, temperature, amount of traffic, blood pressure,... o Data need to be monitored and analyzed on-line (immediate response) CMPT 843, SFU, Martin Ester, 1-06 17

Characteristics of data streams o Massive volumes of data o Records arrive at a rapid rate Requirements Mining Data Streams Challenges o Main memory to small to store all records o Each record is examined at most once o Real time response, i.e. very efficient processing CMPT 843, SFU, Martin Ester, 1-06 18

Mining Data Streams Approach Main Memory Synopsis Data Stream 1... Data Stream m Stream Processing Engine (Approximate) Answer Summarize using samples, histograms or novel methods such as CF-trees How to maximize the approximation accuracy? How to exploit the temporal dimension (aging of data)? CMPT 843, SFU, Martin Ester, 1-06 19

Spatio-Temporal Data Mining Applications Geo-marketing Purchasing patterns for particular geographical areas (e.g., for choice of store location) Health care data analysis Analysis of the spread of diseases Interventions by Public Health Authorities Data referencing the earth surface (spatial) and the time (temporal) CMPT 843, SFU, Martin Ester, 1-06 20

Spatio-Temporal Data Mining Challenges Independence assumption no longer valid Attribute values of neighboring objects are typically correlated Operations on spatial data are very expensive Spatial objects are complex (lines, polygons, 3D surfaces,...) which makes the corresponding operations very expensive Temporal dimension Blows up the pattern search space What patterns do we really want to find in spatio-temporal DB? CMPT 843, SFU, Martin Ester, 1-06 21

Spatio-Temporal Data Mining Consider spatial auto-correlation Approaches Find only patterns that deviate from what is expected according to spatial auto-correlation Efficient support by the DBMS Indexes, basic operations,... Models for spatio-temporal data mining Definition of new pattern types such as spatio-temporal trends CMPT 843, SFU, Martin Ester, 1-06 22

Mining biological data Multi-Relational Data Mining Applications o Molecular biologists collect data on genes, proteins, gene expression, metabolic pathways,... o Want to learn, e.g., about the process of gene regulation Text mining o Using information extraction methods, entities (companies, persons, genes,...) and their relationships (directs, married, regulates,...) can be extracted from a text document o Can be used as input for true text mining: finding knowledge rather than documents CMPT 843, SFU, Martin Ester, 1-06 23

Multi-Relational Data Mining Limitations of Existing Methods Emerging applications are inherently multi-relational o Input: multiple tables (entity sets) and their relationships o Record characteristics: own attributes, related records from other tables and the attributes of these related records Existing data mining methods are single-relational o Input: a single table (relation), Output: refers to attributes of a single table o Data representation as a universal relation (single table) is possible, but may loose a lot of information propositional logic CMPT 843, SFU, Martin Ester, 1-06 24

Multi-Relational Data Mining Inductive Logic Programming Approaches o Logic program: facts (records) and deduction rules (background knowledge) o Task: find (first order) logic rules with some target predicate in the conclusion o Restrict search space by user-specified (syntactic) constraints huge search space syntactic constraints are hard to define only for classification tasks CMPT 843, SFU, Martin Ester, 1-06 25

Multi-Relational Data Mining Approaches First-order versions of standard data mining algorithms o Multi-relational decision trees o Multi-relational association rules What rule format / semantics (in particular, aggregation operations)? Multi-relational distances o Family of distance functions with different depths, taking into account attributes of related records up to the given depth o Standard methods can be applied, e.g. k-means or k-nn classification (global) distance function looses a lot of information CMPT 843, SFU, Martin Ester, 1-06 26

Analysis of the internet Graph Mining Applications o What are the most important web pages? o How will the internet / web look like next year? Social network analysis o What customers should be targeted to maximize the profit of a marketing campaign? o Whom to immunize in order to stop spread of some virus? o Find abnormal subgraphs (e.g., criminal rings). CMPT 843, SFU, Martin Ester, 1-06 27

Graph Mining Challenges Definition of new types of patterns o Certain subgraphs... o Which ones are interesting in a given application? Complexity o Many graph algorithms are NP-complete. o Real graphs tend to be extremely large. Need efficient algorithms Dynamics o Many networks evolve rapidly. CMPT 843, SFU, Martin Ester, 1-06 28

References Text Books Han J., Kamber M., Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, 2000. Hand D., Mannila H., Smyth P. Principles of Data Mining, MIT Press, 2001. Mitchell T. M., Machine Learning, McGraw-Hill, 1997. CMPT 843, SFU, Martin Ester, 1-06 29