DATA MINING II - 1DL460 Spring 2012 A second course in data mining!! http://www.it.uu.se/edu/course/homepage/infoutv2/vt12 Kjell Orsborn! Uppsala Database Laboratory! Department of Information Technology, Uppsala University,! Uppsala, Sweden 5/18/12 1
Kjell Orsborn, lecturer, examiner Personell email: kjell.orsborn@it.uu.se, phone: 471 1154, room: 1321 (house 1, floor 3)! Tore Risch, lecturer email: tore.risch@it.uu.se, phone 471 6342, room 1353 (house 1, floor 3)! Erik Zeitler, lecturer email: erik.zeitler@it.uu.se, phone 471 3390, room 1320 (house 1, floor 3)! Andrej Andrejev, course assistant, email: andrej.andrejev@it.uu.se phone: 471 7345, room 1306 (house 1, floor 3)! Lars Melander, course assistant email: lars.melander.it.uu.se, phone: 471 1051, room 1316 (house 1, floor 3) 5/18/12 2
Preliminary course contents Lecture topics:! Course intro - overview of topics in data mining 2 Web mining Search engines Sequential association analysis Alt. association analysis Visual data exploration Cluster validation Advanced clustering methods: Chamelon, Cure Birch (SNN, Rock, Jarvis-Patrick) Alternative classification techniques: Naïve Bayes Support vector machines Ensemble methods Outlier detection Stream data mining Privacy preserving data mining 5/18/12 3
Course contents continued Assignments:! Assignment 1 Web mining HITS! Assignment 2 Implementation of Association Rule Mining! Assignment 3 Implementation of scalable DBSCAN (Possible alternative to A3 term paper on various dm topics) 5/18/12 4
Examination Written examination grade 3, 4 and 5 Assignments all 3 assignments should be passed with a passing grade 5/18/12 5
Introduction to Data Mining II (Tan, Steinbach, Kumar ch. 1) Kjell Orsborn! Department of Information Technology Uppsala University, Uppsala, Sweden 5/18/12 6
Data Mining The process of extracting valid, previously unknown, comprehensible, and actionable information from large databases and using it to make crucial business decisions, (Simoudis, 1996). Involves the analysis of data and the use of software techniques for finding hidden and unexpected patterns and relationships in sets of data; in contrast to information and knowledge that are already intuitive.! Patterns and relationships are identified by examining the underlying rules and features in the data.! Tends to work from the data up and most accurate results normally require large volumes of data to deliver reliable conclusions.! Data mining can provide huge paybacks for companies who have made a significant investment in data warehousing.! Relatively new technology, however already used in a number of industries. 5/18/12 7
Historic view of data mining Han et al, 2006. 5/18/12 8
The data mining process Data cleaning (to remove noise and inconsistent data) Data integration (where multiple data sources may be combined) Data selection (where data relevant to the analysis task are retrieved from the database) Data transformation (where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations) Data mining (an essential process where intelligent! methods are applied in order to extract data patterns) Pattern evaluation (to identify the truly! interesting patterns representing! knowledge based on some! interestingness measures) Knowledge presentation (where! visualization and knowledge! representation techniques! are used to present the! mined knowledge! to the user) Cleaning & Integration! Selection & Transformation! Data Warehouse Data Mining! Evaluation & Presentation! 1 Knowledge! 6 5 3 2 Patterns Database Database Database File File File 5/18/12 9
Why data mining? There was 5 exabytes of information created between the dawn of civilization through 2003, Schmidt said, but that much information is now created every 2 days, and the pace is increasing...people aren't ready for the technology revolution that's going to happen to them...! (Eric Schmidt, Google)! 5/18/12 10
Why data mining? The explosive growth of data: from terabytes, through petabytes, to exabytes Data collection from automated data collection tools, database systems, web, e-commerce, transactions, stocks, remote sensing, bioinformatics, scientific simulation, computerized society, news, digital cameras, Human analysts may take weeks to discover useful information Much 4,000,000 of the data is never analyzed at all Total new disk (TB) since 1995 3,500,000 3,000,000 2,500,000 The Data Gap! From: R. Grossman, C. Kamath, 2,000,000 1,500,000 V. Kumar, Data Mining for Scientific and Engineering Applications 1,000,000 500,000 Number of analysts 0 1995 1996 1997 1998 1999 5/18/12 11
Why mine data (commercial viewpoint)? Lots of data is being collected! and warehoused Web data, e-commerce purchases at department/! grocery stores Bank & credit card! transactions! Computers have become cheaper and more powerful! Competitive pressure is strong Provide better, customized services for an edge (e.g. in Customer Relationship Management) 5/18/12 12
Why mine data (scientific viewpoint)? Data collected and stored at! enormous speeds (GB/hour) remote sensors on a satellite telescopes scanning the skies microarrays generating gene! expression data scientific simulations! generating terabytes of data! Traditional techniques infeasible for raw data! Data mining may help scientists in classifying and segmenting data in hypothesis formation 5/18/12 13
Why not traditional data analysis? Tremendous amount of data Algorithms must be highly scalable to handle such as tera-bytes of data High-dimensionality of data Micro-array may have tens of thousands of dimensions High complexity of data Data streams and sensor data Time-series data, temporal data, sequence data Structure data, graphs, social networks and multi-linked data Heterogeneous databases and legacy databases Spatial, spatiotemporal, multimedia, text and Web data Software programs, scientific simulations New and sophisticated applications 5/18/12 14
Data mining tasks Prediction methods Use some variables to predict unknown or future values of other variables. Description methods Find human-interpretable patterns that describe the data. From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996 5/18/12 15
Classification - definition Given a collection of records (training set) Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. 5/18/12 16
Clustering - definition Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that Data points in one cluster are more similar to one another. Data points in separate clusters are less similar to one another. Similarity Measures: Euclidean distance if attributes are continuous. Other problem-specific measures. 5/18/12 17
Association rule discovery - definition Given a set of records each of which contain some number of items from a given collection; Produce dependency rules which will predict occurrence of an item based on occurrences of other items. TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer} 5/18/12 18
Sequential pattern discovery definition Given is a set of objects, with each object associated with its own timeline of events, find rules that predict strong sequential dependencies among different events.!!! (A B) (C) (D E) Rules are formed by first disovering patterns. Event occurrences in the patterns are governed by timing constraints. (A B) (C) (D E) <= xg >ng <= ws <= ms 5/18/12 19
Deviation or anomaly detection Detect significant deviations from normal behavior Applications: Credit Card Fraud Detection Network Intrusion Detection Typical network traffic at University level may reach over 100 million connections per day 5/18/12 20
Challenges of data mining Scalability Dimensionality Complex and heterogeneous data Data quality Data ownership and distribution Privacy preservation Streaming data 5/18/12 21