DATA MINING II - 1DL460

Similar documents
DATA MINING II - 1DL460

Knowledge Discovery and Data Mining

INTRODUCTION TO DATA MINING

Statistical Learning and Data Mining CS 363D/ SSC 358

An Introduction to Data Mining BY:GAGAN DEEP KAUSHAL

D B M G Data Base and Data Mining Group of Politecnico di Torino

Data mining fundamentals

Foundation of Data Mining: Introduction

Data Mining. Yi-Cheng Chen ( 陳以錚 ) Dept. of Computer Science & Information Engineering, Tamkang University

Data Mining Course Overview

Data Mining. Chapter 1: Introduction. Adapted from materials by Jiawei Han, Micheline Kamber, and Jian Pei

Data Mining: Introduction. Lecture Notes for Chapter 1. Introduction to Data Mining

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Stats Overview Ji Zhu, Michigan Statistics 1. Overview. Ji Zhu 445C West Hall

Data Mining Concept. References. Why Mine Data? Commercial Viewpoint. Why Mine Data? Scientific Viewpoint

COMP90049 Knowledge Technologies

INTRODUCTION TO BIG DATA, DATA MINING, AND MACHINE LEARNING

COMP 465 Special Topics: Data Mining

CS423: Data Mining. Introduction. Jakramate Bootkrajang. Department of Computer Science Chiang Mai University

Introduction to Data Mining CS 584 Data Mining (Fall 2016)

Introduction to Data Mining

Introduction to Data Mining. Komate AMPHAWAN

CISC 4631 Data Mining Lecture 01:

Thanks to the advances of data processing technologies, a lot of data can be collected and stored in databases efficiently New challenges: with a

CS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University

Introduction to Data Mining S L I D E S B Y : S H R E E J A S W A L

Includes Review of Syllabus OVERVIEW OF THE CLASS

Winter Semester 2009/10 Free University of Bozen, Bolzano

Question Bank. 4) It is the source of information later delivered to data marts.

DATABASE DESIGN I - 1DL300

Knowledge Discovery & Data Mining

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 1

ISM 50 - Business Information Systems

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Chapter 1, Introduction

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

COMP 6838 Data MIning

DATA MINING LECTURE 1. Introduction

1. Inroduction to Data Mininig

CSE-4412: Data Mining

Data Mining. Introduction. Piotr Paszek. (Piotr Paszek) Data Mining DM KDD 1 / 44

PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore

DATA MINING II - 1DL460. Spring 2014"

Data Mining & Machine Learning

Data Mining Concepts & Techniques

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Knowledge Discovery in Data Bases

Data Mining and Warehousing

Data Mining and Data Warehousing Introduction to Data Mining

Knowledge Discovery. Javier Béjar URL - Spring 2019 CS - MIA

CSE5243 INTRO. TO DATA MINING

Big Data Analytics The Data Mining process. Roger Bohn March. 2016

Chapter 3: Data Mining:

Data Mining & Feature Selection

INTRODUCTION TO DATA MINING ASSOCIATION RULES. Luiza Antonie

Chapter 4 Data Mining A Short Introduction

DATA MINING - 1DL105, 1DL111

BIG DATA TESTING: A UNIFIED VIEW

CS249: ADVANCED DATA MINING

DATA MINING II - 1DL460

Knowledge Discovery. URL - Spring 2018 CS - MIA 1/22

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Overview. Data-mining. Commercial & Scientific Applications. Ongoing Research Activities. From Research to Technology Transfer

DATA MINING LECTURE 1. Introduction

DATA MINING LECTURE 1. Introduction

Code No: R Set No. 1

9. Conclusions. 9.1 Definition KDD

Data Mining & Data Warehouse

Dynamic Data in terms of Data Mining Streams

DATABASTEKNIK - 1DL116

DATA WAREHOUSING IN LIBRARIES FOR MANAGING DATABASE

Extended R-Tree Indexing Structure for Ensemble Stream Data Classification

Overview of Web Mining Techniques and its Application towards Web

Data warehouse and Data Mining

Chapter 1 Introduction to Data Mining

International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16

Market baskets Frequent itemsets FP growth. Data mining. Frequent itemset Association&decision rule mining. University of Szeged.

Contents. Preface to the Second Edition

745: Advanced Database Systems

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

CT75 DATA WAREHOUSING AND DATA MINING DEC 2015

DATA MINING INTRO LECTURE. Introduction

Data Mining: Data. What is Data? Lecture Notes for Chapter 2. Introduction to Data Mining. Properties of Attribute Values. Types of Attributes

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

A SURVEY OF DATA MINING & ITS APPLICATIONS

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Database Infrastructure to Support Knowledge Management in Physicochemical Data - Application in NIST/TRC SOURCE Data System

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATIONS (MCA) Semester: IV

Data warehousing and Phases used in Internet Mining Jitender Ahlawat 1, Joni Birla 2, Mohit Yadav 3

Data Mining: Dynamic Past and Promising Future

An Indian Journal FULL PAPER. Trade Science Inc. Research on data mining clustering algorithm in cloud computing environments ABSTRACT KEYWORDS

DATA MINING TRANSACTION

An Improved Apriori Algorithm for Association Rules

Data Stream Mining. Tore Risch Dept. of information technology Uppsala University Sweden

Incremental Frequent Pattern Mining. Abstract

A Brief Introduction to Data Mining

Jarek Szlichta

Chapter 6. Foundations of Business Intelligence: Databases and Information Management VIDEO CASES

Transcription:

DATA MINING II - 1DL460 Spring 2012 A second course in data mining!! http://www.it.uu.se/edu/course/homepage/infoutv2/vt12 Kjell Orsborn! Uppsala Database Laboratory! Department of Information Technology, Uppsala University,! Uppsala, Sweden 5/18/12 1

Kjell Orsborn, lecturer, examiner Personell email: kjell.orsborn@it.uu.se, phone: 471 1154, room: 1321 (house 1, floor 3)! Tore Risch, lecturer email: tore.risch@it.uu.se, phone 471 6342, room 1353 (house 1, floor 3)! Erik Zeitler, lecturer email: erik.zeitler@it.uu.se, phone 471 3390, room 1320 (house 1, floor 3)! Andrej Andrejev, course assistant, email: andrej.andrejev@it.uu.se phone: 471 7345, room 1306 (house 1, floor 3)! Lars Melander, course assistant email: lars.melander.it.uu.se, phone: 471 1051, room 1316 (house 1, floor 3) 5/18/12 2

Preliminary course contents Lecture topics:! Course intro - overview of topics in data mining 2 Web mining Search engines Sequential association analysis Alt. association analysis Visual data exploration Cluster validation Advanced clustering methods: Chamelon, Cure Birch (SNN, Rock, Jarvis-Patrick) Alternative classification techniques: Naïve Bayes Support vector machines Ensemble methods Outlier detection Stream data mining Privacy preserving data mining 5/18/12 3

Course contents continued Assignments:! Assignment 1 Web mining HITS! Assignment 2 Implementation of Association Rule Mining! Assignment 3 Implementation of scalable DBSCAN (Possible alternative to A3 term paper on various dm topics) 5/18/12 4

Examination Written examination grade 3, 4 and 5 Assignments all 3 assignments should be passed with a passing grade 5/18/12 5

Introduction to Data Mining II (Tan, Steinbach, Kumar ch. 1) Kjell Orsborn! Department of Information Technology Uppsala University, Uppsala, Sweden 5/18/12 6

Data Mining The process of extracting valid, previously unknown, comprehensible, and actionable information from large databases and using it to make crucial business decisions, (Simoudis, 1996). Involves the analysis of data and the use of software techniques for finding hidden and unexpected patterns and relationships in sets of data; in contrast to information and knowledge that are already intuitive.! Patterns and relationships are identified by examining the underlying rules and features in the data.! Tends to work from the data up and most accurate results normally require large volumes of data to deliver reliable conclusions.! Data mining can provide huge paybacks for companies who have made a significant investment in data warehousing.! Relatively new technology, however already used in a number of industries. 5/18/12 7

Historic view of data mining Han et al, 2006. 5/18/12 8

The data mining process Data cleaning (to remove noise and inconsistent data) Data integration (where multiple data sources may be combined) Data selection (where data relevant to the analysis task are retrieved from the database) Data transformation (where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations) Data mining (an essential process where intelligent! methods are applied in order to extract data patterns) Pattern evaluation (to identify the truly! interesting patterns representing! knowledge based on some! interestingness measures) Knowledge presentation (where! visualization and knowledge! representation techniques! are used to present the! mined knowledge! to the user) Cleaning & Integration! Selection & Transformation! Data Warehouse Data Mining! Evaluation & Presentation! 1 Knowledge! 6 5 3 2 Patterns Database Database Database File File File 5/18/12 9

Why data mining? There was 5 exabytes of information created between the dawn of civilization through 2003, Schmidt said, but that much information is now created every 2 days, and the pace is increasing...people aren't ready for the technology revolution that's going to happen to them...! (Eric Schmidt, Google)! 5/18/12 10

Why data mining? The explosive growth of data: from terabytes, through petabytes, to exabytes Data collection from automated data collection tools, database systems, web, e-commerce, transactions, stocks, remote sensing, bioinformatics, scientific simulation, computerized society, news, digital cameras, Human analysts may take weeks to discover useful information Much 4,000,000 of the data is never analyzed at all Total new disk (TB) since 1995 3,500,000 3,000,000 2,500,000 The Data Gap! From: R. Grossman, C. Kamath, 2,000,000 1,500,000 V. Kumar, Data Mining for Scientific and Engineering Applications 1,000,000 500,000 Number of analysts 0 1995 1996 1997 1998 1999 5/18/12 11

Why mine data (commercial viewpoint)? Lots of data is being collected! and warehoused Web data, e-commerce purchases at department/! grocery stores Bank & credit card! transactions! Computers have become cheaper and more powerful! Competitive pressure is strong Provide better, customized services for an edge (e.g. in Customer Relationship Management) 5/18/12 12

Why mine data (scientific viewpoint)? Data collected and stored at! enormous speeds (GB/hour) remote sensors on a satellite telescopes scanning the skies microarrays generating gene! expression data scientific simulations! generating terabytes of data! Traditional techniques infeasible for raw data! Data mining may help scientists in classifying and segmenting data in hypothesis formation 5/18/12 13

Why not traditional data analysis? Tremendous amount of data Algorithms must be highly scalable to handle such as tera-bytes of data High-dimensionality of data Micro-array may have tens of thousands of dimensions High complexity of data Data streams and sensor data Time-series data, temporal data, sequence data Structure data, graphs, social networks and multi-linked data Heterogeneous databases and legacy databases Spatial, spatiotemporal, multimedia, text and Web data Software programs, scientific simulations New and sophisticated applications 5/18/12 14

Data mining tasks Prediction methods Use some variables to predict unknown or future values of other variables. Description methods Find human-interpretable patterns that describe the data. From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996 5/18/12 15

Classification - definition Given a collection of records (training set) Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. 5/18/12 16

Clustering - definition Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that Data points in one cluster are more similar to one another. Data points in separate clusters are less similar to one another. Similarity Measures: Euclidean distance if attributes are continuous. Other problem-specific measures. 5/18/12 17

Association rule discovery - definition Given a set of records each of which contain some number of items from a given collection; Produce dependency rules which will predict occurrence of an item based on occurrences of other items. TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer} 5/18/12 18

Sequential pattern discovery definition Given is a set of objects, with each object associated with its own timeline of events, find rules that predict strong sequential dependencies among different events.!!! (A B) (C) (D E) Rules are formed by first disovering patterns. Event occurrences in the patterns are governed by timing constraints. (A B) (C) (D E) <= xg >ng <= ws <= ms 5/18/12 19

Deviation or anomaly detection Detect significant deviations from normal behavior Applications: Credit Card Fraud Detection Network Intrusion Detection Typical network traffic at University level may reach over 100 million connections per day 5/18/12 20

Challenges of data mining Scalability Dimensionality Complex and heterogeneous data Data quality Data ownership and distribution Privacy preservation Streaming data 5/18/12 21