Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Similar documents
Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Introduction to Data Mining. Lijun Zhang

SCHEME OF COURSE WORK. Data Warehousing and Data mining

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1394

Chapter 1, Introduction

Data Mining Course Overview

Table Of Contents: xix Foreword to Second Edition

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Data Mining. Introduction. Piotr Paszek. (Piotr Paszek) Data Mining DM KDD 1 / 44

DATA ANALYSIS I. Types of Attributes Sparse, Incomplete, Inaccurate Data

R07. FirstRanker. 7. a) What is text mining? Describe about basic measures for text retrieval. b) Briefly describe document cluster analysis.

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Introduction to Data Mining and Data Analytics

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Visualization and text mining of patent and non-patent data

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset

Data warehouse and Data Mining

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATIONS (MCA) Semester: IV


D B M G Data Base and Data Mining Group of Politecnico di Torino

Analyzing Outlier Detection Techniques with Hybrid Method

Detection and Deletion of Outliers from Large Datasets

Data Mining. Chapter 1: Introduction. Adapted from materials by Jiawei Han, Micheline Kamber, and Jian Pei

Performance Analysis of Data Mining Classification Techniques

Dynamic Data in terms of Data Mining Streams

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Fall Principles of Knowledge Discovery in Databases. University of Alberta

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Data Preprocessing. Slides by: Shree Jaswal

Comparative Study of Clustering Algorithms using R

Chapter 3: Data Mining:

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Data mining fundamentals

CSE5243 INTRO. TO DATA MINING

Overview of Web Mining Techniques and its Application towards Web

Outlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY Keynote, Outlier Detection and Description Workshop, 2013

Data Mining. Data warehousing. Hamid Beigy. Sharif University of Technology. Fall 1394

MetaData for Database Mining

Ubiquitous Computing and Communication Journal (ISSN )

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Question Bank. 4) It is the source of information later delivered to data marts.

PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore

Machine Learning - Clustering. CS102 Fall 2017

Anomaly Detection on Data Streams with High Dimensional Data Environment

Research on Data Mining Technology Based on Business Intelligence. Yang WANG

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA

DOI:: /ijarcsse/V7I1/0111

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

9. Conclusions. 9.1 Definition KDD

1. Inroduction to Data Mininig

UNIT 2 Data Preprocessing

Data Mining and Analytics. Introduction

Winter Semester 2009/10 Free University of Bozen, Bolzano

CS513-Data Mining. Lecture 2: Understanding the Data. Waheed Noor

COMP 465 Special Topics: Data Mining

A Program demonstrating Gini Index Classification

Contents. Preface to the Second Edition

INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad

DATA MINING AND WAREHOUSING

DATA WAREHOUING UNIT I

Dynamic Clustering of Data with Modified K-Means Algorithm

Pre-Requisites: CS2510. NU Core Designations: AD

Basic Data Mining Technique

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

Thanks to the advances of data processing technologies, a lot of data can be collected and stored in databases efficiently New challenges: with a

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 1

Chapter 4 Data Mining A Short Introduction

IMPROVING THE PERFORMANCE OF OUTLIER DETECTION METHODS FOR CATEGORICAL DATA BY USING WEIGHTING FUNCTION

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4

2. Data Preprocessing

Introduction to Data Science

Knowledge Modelling and Management. Part B (9)

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

Review on Data Mining Techniques for Intrusion Detection System

An Improved Apriori Algorithm for Association Rules

K236: Basis of Data Science

SK International Journal of Multidisciplinary Research Hub Research Article / Survey Paper / Case Study Published By: SK Publisher

Introduction to Data Mining S L I D E S B Y : S H R E E J A S W A L

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining

Data Mining. Jeff M. Phillips. January 12, 2015 CS 5140 / CS 6140

Semi-Supervised Clustering with Partial Background Information

COURSE PLAN. Computer Science & Engineering

UNIT 2. DATA PREPROCESSING AND ASSOCIATION RULES

Data Mining. Yi-Cheng Chen ( 陳以錚 ) Dept. of Computer Science & Information Engineering, Tamkang University

DECISION TREE ANALYSIS ON J48 AND RANDOM FOREST ALGORITHM FOR DATA MINING USING BREAST CANCER MICROARRAY DATASET

Knowledge Discovery and Data Mining 1 (VO) ( )

ECT7110. Data Preprocessing. Prof. Wai Lam. ECT7110 Data Preprocessing 1

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

CT75 (ALCCS) DATA WAREHOUSING AND DATA MINING JUN

Keywords: clustering algorithms, unsupervised learning, cluster validity

CSE 626: Data mining. Instructor: Sargur N. Srihari. Phone: , ext. 113

CS423: Data Mining. Introduction. Jakramate Bootkrajang. Department of Computer Science Chiang Mai University

Tutorial Case studies

Supervised Learning with Neural Networks. We now look at how an agent might learn to solve a general problem by seeing examples.

IJSER. Privacy and Data Mining

Data Mining: Exploring Data. Lecture Notes for Chapter 3

Transcription:

Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 20

Table of contents 1 Introduction 2 Data mining process 3 Major problems in data mining 4 Outline of course Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 2 / 20

Introduction Data mining is the study of collecting, cleaning, processing, analysing, and gaining useful insight from data. Data mining is a broad term that is used to describe different aspects of data processing (depending on the problem domain, applications, formulation, and data representations). Data mining is needed because virtually all automated systems generate some form of data either for diagnostic or analysis purpose. Examples of different kinds of data World wide web (web logs, web graph,...). Financial interactions. User interactions. Sensors and internet of things. We hope that we can extract concise and possibly actionable insight from available data for application specific goals. Data may be arbitrary, unstructured, and even in a format that is not immedietly sutable for automated processing. To address the above issues, data mining analysts use a pipeline of processing, where the new data is collected, cleaned, and transformed into standadized format. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 3 / 20

Data mining process The workflow of a data mining application contains the following phases. Data collection After the collection phase, the data is often stored in a databas (more generally a data warehouse). Feature extraction and data cleaning When the data is collected, it is often not in a form that is suitable for processing. For processing, it is transformed into suitable format such as multidimensional, time-series, or semi-structured. The multidimensional format is the most common one, in which different fields of the correspond to the different measured properties called as features, attributes, or dimensions. The feature extraction phase performed in parallel with data cleaning, where missing and erroneous parts of the data are either estimated or corrected. The data may be extracted from multiple sources and integrated into a unified format for processing. Analytical processing and algorithms The final part of the mining process is to design effective analytical methods from the processed data. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 4 / 20

Data preprocessing phase The data preprocessing phase is perhaps the most crucial one in the data mining process. This phase begins after the collection of the data, and it consists of the following steps: Feature extraction Extracting the right features is often a skill that requires an understanding of the specific application domain at hand. Data cleaning The extracted data may have erroneous or missing entries. Therefore, some records may need to be dropped, or missing entries may need to be estimated. Inconsistencies may need to be removed. Feature selection and transformation When the data is very high dimensional, many data mining algorithms do not work effectively. Furthermore, many of the high- dimensional features are noisy and may add errors to the data mining process. Therefore, a variety of methods are used to either remove irrelevant features or transform the current set of features to a new data space that is more amenable for analysis. The data cleaning process requires statistical methods that are commonly used for missing data estimation. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 5 / 20

Basic data types One interesting aspect of the data mining process is the wide variety of data types that are available for analysis. There are two broad types of data for the data mining process: Non-dependency oriented data This typically refers to simple data types such as multidimensional data or text data. These data types are the simplest and most commonly encountered. Dependency oriented data In these cases, implicit or explicit relationships may exist between data items. For example, a social network data set contains a set of vertices (data items) that are connected together by a set of edges (relationships). If the dependencies between data items are not explicitly specified but are known to typically exist in that domain, the dependencies are called Implicit dependencies such as temperatures measured from a sensor. The explicit dependencies refers to graph or network data in which edges are used to specify explicit relationships. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 6 / 20

Non-dependency oriented data This is the simplest form of data and typically refers to multidimensional data. This data typically contains a set of records. A record is also referred to as a data point, instance, example, transaction, entity, tuple, object, or feature-vector, depending on the application at hand. Each record contains a set of fields, which are also referred to as attributes, dimensions, and features. These fields describe the different properties of that record. Definition (Multidimensional Data) A multidimensional data set D is a set of n records, X 1, X 2,..., X n, such that each record X i contains a set of d features denoted by (x i1, x i2,..., x id ). Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 7 / 20

Non-dependency oriented data The following data set has two different data types. The age field has values that are numerical in the sense that they have a natural ordering. Such attributes are referred to as continuous, numeric or quantitative. Data in which all fields are quantitative is also referred to as quantitative data or numeric data. The attributes such as gender, race, and ZIP code, have discrete values without a natural ordering among them. This type of data is categorical, then such data is referred to as unordered discrete-valued or categorical. In the case of mixed attribute data, there is a combination of categorical and numeric attributes. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 8 / 20

Major problems in data mining Four problems in data mining are considered fundamental to the mining process. These problems correspond to clustering, classification, association pattern mining, and outlier detection, and they are encountered repeatedly in the context of many data mining applications. To answer these questions, one must understand the nature of the typical relationships that data scientists often try to extract from the data. Consider a multidimensional database D with n records, and d attributes. Such a database D may be represented as an n d matrix D, in which each row corresponds to one record and each column corresponds to a dimension. We generally refer to this matrix as the data matrix. Broadly speaking, data mining is all about finding summary relationships between the entries in the data matrix that are either unusually frequent or unusually infrequent. Relationships between data items are one of two kinds: When considering relationships between columns, the frequent or infrequent relationships between the values in a particular row are determined. This maps into either the positive or negative association pattern mining problem. In some cases, one column is considered more important. This problem is referred to as data classification. When considering relationships between rows, the goal is to determine subsets of rows, in which values in columns are related. In cases where these subsets are similar, the problem is referred to as clustering. When entries in a row are very different from the entries in other rows, then that row becomes interesting as an anomaly. This problem is referred to as outlier analysis. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 9 / 20

Associative pattern mining In its most primitive form, the association pattern mining problem is defined in the context of sparse binary databases. Most customer transaction databases are of this type. A particularly commonly studied version of this problem is the frequent pattern mining problem or, more generally, the association pattern mining problem. Definition (Frequent pattern mining) Given a binary n d data matrix D, determine all subsets of columns such that all the values in these columns take on the value of 1 for at least a fraction s of the rows in the matrix. The relative frequency of a pattern is referred to as its support. Patterns that satisfy the minimum support requirement are often referred to as frequent patterns. Frequent patterns represent an important class of association patterns. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 10 / 20

Data clustering A broad and informal definition of the clustering problem is as follows: Definition (Data Clustering) Given a data matrix D, partition its rows (records) into sets C 1, C 2,..., C k, such that the rows (records) in each cluster are similar to one another. An important part of the clustering process is the design of an appropriate similarity function for the computation process. Some examples of relevant applications are as follows: Customer segmentation Data summarization Application to other data mining problems Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 11 / 20

Outlier Detection An outlier is a data point that is significantly different from the remaining data. Outliers are also referred to as abnormalities, discordants, deviants, or anomalies in the data mining and statistics literature. Definition (Outlier Detection) Given a data matrix D, determine the rows of the data matrix that are very different from the remaining rows in the matrix. Some examples of relevant applications of outlier detection are as follows: Intrusion-detection systems Credit card fraud Interesting sensor events Medical diagnosis Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 12 / 20

Data classification Many data mining problems are directed toward a specialized goal that is sometimes represented by the value of a particular feature in the data. This particular feature is referred to as the class label. Definition (Data Classification) Given an n d training data matrix D, and a class label value in {1,..., k} associated with each of the n rows in D, create a training model M, which can be used to predict the class label of a d-dimensional record X / D. Some examples of applications where the classification problem is used are as follows: Target marketing Features about customers are related to their buying behavior with the use of a training model. Intrusion detection The sequences of customer activity in a computer system may be used to predict the possibility of intrusions. Supervised anomaly detection The rare class may be differentiated from the normal class when previous examples of outliers are available. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 13 / 20

Outline of course 1 Introduction 2 Data collection & preprocessing Data cleaning Data warehouse Feature extraction & selection 3 Frequent pattern mining Frequent itemset mining Summarizing itemsets Graph & sequence mining Evaluating patterns and association rules 4 Data clustering Representation-based & Hierarchical clustering Density-based clustering Spectral & graph clustering Clustering validation 5 Data classification Probablistic classifiers Decision tree classifiers Linear discriminant analysis & support vector machines Classification assessment 6 Outlier detection 7 Advanced topic & applications Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 14 / 20

References Charu C. Aggarwal, Data Mining, Springer, 2015. J. Han, M. Kamber, and Jian Pei, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2012. M. J. Zaki and W. M. JR, Data Mining and Analysis : Fundamental Concepts and Algorithms, Cambridge University Press, 2014. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 15 / 20

Some relevant journals 1 IEEE Trans on Pattern Analysis and Machine Intelligence 2 Journal of Machine Learning Research 3 Pattern Recognition 4 Machine Learning 5 Neural Networks 6 Neural Computation 7 Neurocomputing 8 IEEE Trans. on Neural Networks and Learning Systems 9 Annuals of Statistics 10 Journal of the American Statistical Association 11 Pattern Recognition Letters 12 Artificial Intelligence 13 Data Mining and Knowledge Discovery 14 IEEE Transaction on Cybernetics (SMC-B) 15 IEEE Transaction on Knowledge and Data Engineering 16 Knowledge and Information Systems Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 16 / 20

Some relevant conferences 1 Neural Information Processing Systems (NIPS) 2 International Conference on Machine Learning (ICML) 3 European Conference on Machine Learning (ECML) 4 Asian Conference on Machine Learning (ACML2013) 5 Conference on Uncertainty in Artificial Intelligence (UAI) 6 Practice of Knowledge Discovery in Databases (PKDD) 7 International Joint Conference on Artificial Intelligence (IJCAI) 8 IEEE International Conference on Data Mining series (ICDM) Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 17 / 20

Relevant packages and datasets 1 Packages: R http://www.r-project.org/ Weka http://www.cs.waikato.ac.nz/ml/weka/ RapidMiner http://rapidminer.com/ MOA http://moa.cs.waikato.ac.nz/ 2 Datasets: UCI Machine Learning Repository http://archive.ics.uci.edu/ml/ StatLib http://lib.stat.cmu.edu/datasets/ Delve http://www.cs.toronto.edu/~delve/data/datasets.html Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 18 / 20

Course evaluation Evaluation: First Mid-term 2 1394/8/14 Second Mid-term 3 1394/9/5 Final exam 6 Sum of all exams 6.5 for passing Quiz 3 Presentation 1 Homeworks 5 Using R in all homeworks 0.5 Scribe notes 0.5 Course page: http://ce.sharif.edu/courses/94-95/1/ce714-1/ TAs : Alireza Baraani Zohre Fallahnejad baraani@ce.sharif.edu z.fallahnejad@gmail.com Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 19 / 20

Papers for seminars Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 20 / 20