Winter Semester 2009/10 Free University of Bozen, Bolzano

Similar documents
Data Mining. Chapter 1: Introduction. Adapted from materials by Jiawei Han, Micheline Kamber, and Jian Pei

Introduction to Data Mining S L I D E S B Y : S H R E E J A S W A L

Knowledge Discovery and Data Mining

CS423: Data Mining. Introduction. Jakramate Bootkrajang. Department of Computer Science Chiang Mai University

COMP 465 Special Topics: Data Mining

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

CS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University

1. Inroduction to Data Mininig

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 1

Introduction to Data Mining

Data Mining. Yi-Cheng Chen ( 陳以錚 ) Dept. of Computer Science & Information Engineering, Tamkang University

Chapter 1, Introduction

DATA MINING II - 1DL460

CSE5243 INTRO. TO DATA MINING

DATA MINING II - 1DL460

Knowledge Modelling and Management. Part B (9)

Table Of Contents: xix Foreword to Second Edition

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATIONS (MCA) Semester: IV

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Data warehouse and Data Mining

PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore

Question Bank. 4) It is the source of information later delivered to data marts.

Data Mining Course Overview

INTRODUCTION TO BIG DATA, DATA MINING, AND MACHINE LEARNING

D B M G Data Base and Data Mining Group of Politecnico di Torino

INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad

Data Mining and Analytics. Introduction

CT75 DATA WAREHOUSING AND DATA MINING DEC 2015

Machine Learning & Data Mining

R07. FirstRanker. 7. a) What is text mining? Describe about basic measures for text retrieval. b) Briefly describe document cluster analysis.

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Data mining fundamentals

CT75 (ALCCS) DATA WAREHOUSING AND DATA MINING JUN

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

cse643 Data Mining Professor Anita Wasilewska Computer Science Department Stony Brook University

ETL and OLAP Systems

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

Thanks to the advances of data processing technologies, a lot of data can be collected and stored in databases efficiently New challenges: with a

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Data Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA.

Data Mining Concept. References. Why Mine Data? Commercial Viewpoint. Why Mine Data? Scientific Viewpoint

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA

Data Mining. Introduction. Piotr Paszek. (Piotr Paszek) Data Mining DM KDD 1 / 44

Overview of Web Mining Techniques and its Application towards Web

1.1 What Motivated Data Mining? Why Is It Important?

DATA WAREHOUING UNIT I

Data Mining. Vera Goebel. Department of Informatics, University of Oslo

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Data Mining: Concepts and Techniques

Knowledge Discovery in Data Bases

INTRODUCTION TO DATA MINING

Data warehouses Decision support The multidimensional model OLAP queries

Data Mining: Dynamic Past and Promising Future

CS249: ADVANCED DATA MINING

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Overview. Introduction to Data Warehousing and Business Intelligence. BI Is Important. What is Business Intelligence (BI)?


Data Preprocessing. Slides by: Shree Jaswal

TIM 50 - Business Information Systems

Knowledge Discovery & Data Mining

DATA MINING AND WAREHOUSING

ISM 50 - Business Information Systems

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies

DATA MINING Introductory and Advanced Topics Part I

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

SCHEME OF COURSE WORK. Data Warehousing and Data mining

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

9. Conclusions. 9.1 Definition KDD

CS 412 Intro. to Data Mining

IT6702 DATA WAREHOUSING AND DATA MINING TWO MARKS WITH ANSWER UNIT-1 DATA WAREHOUSING

COURSE PLAN. Computer Science & Engineering

Parametric Comparisons of Classification Techniques in Data Mining Applications

CSE-4412: Data Mining

Privacy Overview and Data Mining CSC 301 Spring 2018 Howard Rosenthal

Data Mining Download or Read Online ebook data mining in PDF Format From The Best User Guide Database

COMP 6838 Data MIning

OLAP Introduction and Overview

Data Mining and Knowledge Management Process

Knowledge Discovery. Javier Béjar URL - Spring 2019 CS - MIA

SIDDHARTH GROUP OF INSTITUTIONS :: PUTTUR Siddharth Nagar, Narayanavanam Road QUESTION BANK (DESCRIPTIVE)

3 Data, Data Mining. Chengkai Li

Management Information Systems MANAGING THE DIGITAL FIRM, 12 TH EDITION FOUNDATIONS OF BUSINESS INTELLIGENCE: DATABASES AND INFORMATION MANAGEMENT

Knowledge Discovery. URL - Spring 2018 CS - MIA 1/22

Data Mining Concepts

Big Data Analytics The Data Mining process. Roger Bohn March. 2016

UNIT 2. DATA PREPROCESSING AND ASSOCIATION RULES

cse643, cse590 Data Mining Anita Wasilewska Computer Science Department Stony Brook University Stony Brook NY 11794

Data Mining and Data Warehousing Introduction to Data Mining

1. What are the nine decisions in the design of the data warehouse?

Data Mining Technology Based on Bayesian Network Structure Applied in Learning

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3

Data Preprocessing. Komate AMPHAWAN

Dynamic Data in terms of Data Mining Streams

TIM 50 - Business Information Systems

A Review on Cluster Based Approach in Data Mining

Code No: R Set No. 1

Concepts and Techniques. Data Mining: Slides related to: University of Illinois at Urbana-Champaign

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4

Time: 3 hours. Full Marks: 70. The figures in the margin indicate full marks. Answers from all the Groups as directed. Group A.

Fall Principles of Knowledge Discovery in Databases. University of Alberta

Transcription:

Data Warehousing and Data Mining Winter Semester 2009/10 Free University of Bozen, Bolzano DW Lecturer: Johann Gamper gamper@inf.unibz.it DM Lecturer: Mouna Kacimi mouna.kacimi@unibz.it http://www.inf.unibz.it/dis/teaching/dwdm/index.html

Organization Lectures Tuesday & Thursday From 10:30 To 12:30. Rooms are dynamic Office hours Dr. Kacimi: Thursday From 14:00 to 17:00 (appointment by email) Projects Lab hours Tuesday From 14:00 to 16:00. Room is dynamic Requirements: obtain at least 18 points in each of the following Project: - use & implement algorithms - write a project report - present the project Exam: - have knowledge about the course - be able to present it Textbooks Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, Second Edition, 2006 Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, "Introduction to Data Mining", Pearson Addison Wesley, 2008, ISBN: 0-32-134136-7 Margaret H. Dunham, Data Mining: Introductory and Advanced Topics, Prentice Hall, 2003

Data Miningi

Outline Introduction to Data Mining Data Analysis and Uncertainty Part I: Introduction & Foundations Classification & Prediction Cluster Analysis Applications Part II: Supervised Learning Part III: Unsupervised Learning Part IV: Summary & Open problems

Chapter I: Introduction & Foundations 1.1 Introduction 1.1.1 Definitions & Motivations 1.1.2 Data to be Mined 1.1.3 Knowledge to be discovered 1.1.4 Techniques Utilized 1.1.51 Applications Adapted 1.1.6 Major Issues in Data Mining 1.2 Getting to Know Your Data 1.2.1 Data Objects and Attribute Types 1.2.2 Descriptive Data Summarization 1.2.3 Measuring Data Similarity and Dissimilarity 1.3 Basics from Probability Theory and Statistics 1.3.1 Probability Theory 1.3.2 Statistical Inference: Sampling and Estimation 1.3.3 Statistical Inference: Hypothesis Testing and Regression

1.1 Definitions & Motivations Why Data Mining? i Explosive Growth of Data: from terabytes to petabytes Data Collections and Data Availability Crawlers, database systems, Web, etc. Sources Business: Web, e-commerce, transactions, etc. Science: Remote sensing, bioinformatics, etc. Society and everyone: news, YouTube, etc. Problem: We are drowning in data, but starving for knowledge! Solution: Use Data Mining tools for Automated Analysis of massive data sets

What is Data Mining? Data mining (knowledge discovery from data) Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data Data mining: a misnomer? Stone Gold Mining Not Stone Mining Data Knowledge Knowledge Mining? Alternative names Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

Knowledge Discovery (KDD) Process View from typical database systems and data warehousing communities Knowledge Evaluation & Presentation Data Mining Patterns Selection & transformation Task-relevant Data Data Cleaning & Integration Data Warehouse Data mining plays an essential role in the knowledge discovery process Databases

Knowledge Discovery (KDD) Process Data Cleaning Remove noise and inconsistent data Data Integration Combine multiple data sources Data Selection Data relevant to analysis tasks are retrieved form the data Data transformation Transform data into appropriate form for mining (summary, aggregation, etc.) Data mining Extract data patterns Pattern Evaluation Identify truly interesting patterns Knowledge representation Use visualization a and knowledge representation tools to present the mined data to the user

Typical Architecture of a Data Mining System Knowledge Base Guide the search Evaluate interestingness of the results Include Concept hierarchies User believes Constraints, thresholds, metadata, etc. User Interface Pattern Evaluation Data Mining Engine Database or Data Warehouse Server Data cleaning, Integration and selection Knowledge Base Database Data Warehouse World Wide Web Other Info Repositories

Confluence of Multiple Disciplines Database Technology Statistics Machine Learning Data Mining Visualization Pattern Recognition Algorithm Other Disciplines

Why Confluence of Multiple Disciplines? Tremendous amount of data Scalable algorithms to handle terabytes of data (e.g., Flickr had 3.6 billion images in June, 2009) High dimensionality of data Data can have tens of thousands of features (e,g., DNA microarray) High complexity of data Data can be highly complex, can be of different types, and can include different descriptors Images can be described using text and visual features such as color, texture, contours, etc. Videos can be described using text, images and their descriptors, audio phonemes, etc. Social networks can have a complex structure... New and sophisticated applications Applications can be difficult (e.g., medical applications).

Different Views of Data Mining Data View Kinds of data to be mined Knowledge view Kinds of knowledge to be discovered Method view Kinds of techniques utilized Application view Kinds of applications

1.1.2 Data to be Mined In principle, data mining should be applicable to any data repository This lecture includes examples about: Relational databases Data warehouses Transactional databases Advanced d database systems

Relational Databases Database System Collection of interrelated data, known as database A set of software programs that manage and access the data Relational Databases (RD) A collection of tables. Each one has a unique name A table contains a set of attributes (columns) & tuples (rows). Each object in a relational table has a unique key and is described by a set of attribute values. Costumers Data are accessed using database cust_id Name age income 152 Anna 27 24000 queries (SQL): projection, join, etc............. Data Mining applied to RD Purchases Search for trends or data patterns T156 152 Visa 1357 Example:............ trans_id cust_id method Amount predict the credit risk of costumers based on their income, age and expenses.

Data Warehouses A data warehouse (DW) is a repository of information collected from multiple sources, stored under a unified schema. Data source in Bolzano Data source in Paris Data source in Madrid Clean Integrate Transform Load refresh Data Warehouse Query and Analysis Tools Client Client Data organized around major subjects (using summarization) Multidimensional database structure (e.g., data Cube) Dimension = one attribute or a set of attributes Cell = stores the value of some aggregated measures. Data Mining applied to DW Data warehouse tools help data analysis Data Mining tools are required to allow more in-depth and automated analysis

Transactional Databases A transactional database (TD) consists of a file where each record represents a transaction. A transaction includes a unique transaction identifier (trans_id) and a list of the items making the transaction. A transaction database may include other tables containing other information regarding g the sale trans_id List of items_ids T100 I1,I3,I8,I16 (customer_id, location, etc.) T200 I2,I8 Basic analysis (examples)...... Show me all the items purchased by David Smith? How many transactions include item number 5? Data Mining on TD Perform a deeper analysis Example: Which items sold well together? Basically, data mining systems can identify frequent sets in transactional databases and perform market basket data analysis.

Advanced Database Systems(1) Advanced database systems provide tools for handling complex data Spatial data (e.g., maps) Engineering design data (e.g., buildings, system components) Hypertext and multimedia data (text, image, audio, and video) Time-related e ed data a (e.g., historical records) Stream data (e.g., video surveillance and sensor data) World Wide Web, a huge, widely distributed information repository made available by Internet Require efficient data structures and scalable methods to handle Complex object structures and variable length records Semi structured or unstructured data Multimedia and spatiotemporal data Database schema with complex and dynamic structures

Advanced Database Systems(2) Example: World Wide Web Provide rich, worldwide, online and distributed information services. Data objects are linked together Users traverse from one object via links to another Problems Data can be highly unstructured Difficult to understand the semantic of web pages and their context. Data Mining on WWW Web usage Mining (user access pattern) Improve system design (efficiency) Better marketing decisions (adverts, user profile) Authoritative Web page Analysis Ranking web pages based on their importance Automated Web page clustering and classification Group and arrange web pages based on their content Web community analysis Identify hidden web social networks and observe their evolution

1.1.3 Knowledge to be Discovered Data mining functionalities are used to specify the kind of patterns to be found in data mining tasks Data mining tasks can be classified into two categories Descriptive : Characterize the general properties of the data Predictive : Perform inference on the current data to make predictions What to extract? Users may not have an idea about what kinds of patterns in their data can be interesting What to do? Have a data mining system that can mine multiple types of patterns to handle different user and applications needs patterns to handle different user and applications needs. Discover patterns at various granularities (levels of abstraction) Street City Country Allow users to guide the search for interesting patterns Example of different granularities

Characterization and Discrimination (1) Data can be associated with classes or concepts Example of data from a store Classes Concepts printers computers Big-Spenders Budget-Spenders Class/Concept descriptions: describe individual classes and concepts in summarized, concise, and precise way. Data characterization Summarize the data of the class under study (target class) Data Discrimination Compare the target class with a set of comparative classes (contrasting classes) Data characterization & Discrimination Perform both analysis

Characterization and Discrimination (2) Data Characterization Summarize the general features of a target class of data Tools: statistical measures, data cube-based OLAP roll-up, etc. Output: charts, curves, multidimensional data cubes, etc. Example Summarize the characteristics of costumers who spend more than 1000 Data Discrimination Costumers profile 40-50 years old Employed excellent credit ratings Comparison of the general features of a target class with the general features of contrasting classes Output: similar to characterization + comparative measures Example Compare customers who shop for computer products regularly( more than 2 times a month) with those who rarely shop for such products(less then three times a year) Frequent costumers Comparative profile Rare costumers 80% 60% Are between 20 and 40 Are senior or youths Have university education Have no university degree

Frequent Patterns, Associations, Correlations Frequent patterns are patterns occurring frequently in the data (e.g., item-sets, sub-sequences, and substructures) Frequent item-sets: items that frequently appear together Example in a transactional data set: bred and milk Frequent Sequential pattern: a frequently occurring subsequence Example in a transactional data set: buy first PC, second digital camera, third memory card Association Analysis Derive some association rules buys(x, computer ) buys (X, software ) [support =1%, confidence=50%] age(x, 20...29 ) income(x, 20K...29K ) buys (X, CD player ) [support =2%, confidence=60%] Correlation Analysis Uncover interesting statistical correlations between associated attribute-value pairs

Classification and Prediction Construct models (functions) based on some training examples Describe and distinguish classes or concepts for future prediction Predict some unknown class labels Training examples Age Income Class label 27 28K Budget-Spenders 35 36K Big-Spenders 65 45K Budget-Spenders Supervised Learning Classification model (function) Unlabeled data Age Income 29 25K Typical methods Classifier Class label [Budget Spender] Numeric value [Budget Spender (0.8)] Decision trees, naive Bayesian classification, logistic regression, support vector machine, neural networks, etc. Typical Applications Credit card fraud detection, ti classifying i web pages, stars, diseases, etc.

Cluster Analysis Unsupervised learning (i.e., Class label is unknown) Group data to form new categories (i.e., clusters), e.g., cluster houses to find distribution patterns Principle: Maximizing intra-class similarity & minimizing interclass similarity Typical methods Hierarchical methods, density-based methods, Grid-based methods, Model-Based methods, constraint-based methods, etc. Typical Applications WWW, social networks, Marketing, Biology, Library, etc.

Outlier Analysis Outlier: A data object that does not comply with the general behavior of the data learning (i.e., Class label is unknown) Noise or exception? One person s s garbage could be another person s treasure Typical methods Or? Product of clustering or regression analysis, etc. Typical Applications Useful in fraud detection Example How to Uncover fraudulent usage of credit card? Detect purchases of extremely large amounts for a given account number in comparison to regular charges incurred by the same account Outliers may also be detected with respect to the location and type of purchase, or the frequency.

Evolution Analysis Evolution Analysis describes trends of data objects whose behavior changes over time It includes Characterization and discrimination analysis Association and correlation analysis Classification and prediction Clustering of time related data Distinct features for such analysis Time-series data analysis Sequence or periodicity pattern matching e.g., first buy digital camera, a, then buy large age SD memory oy cards cads Similarity-based analysis

1.1.4 Techniques Utilized Data-intensive Data warehouse (OLAP) Machine learning Statistics Pattern recognition Visualization High-performance...

1.1.5 Applications Adapted Web page analysis: from web page classification, clustering to PageRank & HITS algorithms Collaborative analysis & recommender systems Basket data analysis to targeted marketing Biological and medical data analysis: classification, cluster analysis (microarray data analysis), biological sequence analysis, biological network analysis From major dedicated data mining systems/tools (e.g., SAS, MS SQL-Server Analysis Manager, Oracle Data Mining Tools) to invisible data mining

1.1.6 Major Challenges in Data Mining Efficiency and scalability of data mining algorithms Parallel, distributed, stream, and incremental mining methods Handling high-dimensionalityh i Handling noise, uncertainty, and incompleteness of data Incorporation of constraints, expert knowledge, and background knowledge in data mining Pattern evaluation and knowledge integration Mining diverse and heterogeneous kinds of data: e.g., bioinformatics, Web, software/system engineering, information networks Application-oriented and domain-specific data mining Invisible data mining (embedded in other functional modules) Protection of security, integrity, and privacy in data mining

Summary of Section 1.1 Data Mining is a process of extracting knowledge from data Data to be mined can be of any type Relational Databases, Advanced databases, etc. Knowledge to be discovered Frequent patterns, correlations, associations, classification, prediction, cluster Techniques to be used Statistics, machine learning, visualization, etc. Data Mining i is interdisciplinary i Large amount of complex data and sophisticated applications Challenges of data Mining Efficiency, scalability, parallel and distributed mining, handling high dimensionality, handling noisy data, mining heterogeneous data, etc.