Introduction to Data Mining and Data Analytics

Similar documents
Data Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA.

Knowledge Discovery. URL - Spring 2018 CS - MIA 1/22

Knowledge Discovery in Data Bases

Embedded Technosolutions

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition

Big Data Analytics The Data Mining process. Roger Bohn March. 2016

COMP 465 Special Topics: Data Mining

Knowledge Discovery. Javier Béjar URL - Spring 2019 CS - MIA

Based on Big Data: Hype or Hallelujah? by Elena Baralis

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

Data Mining Course Overview

Jarek Szlichta

SOCIAL MEDIA MINING. Data Mining Essentials

Data Mining Concepts & Tasks

Thanks to the advances of data processing technologies, a lot of data can be collected and stored in databases efficiently New challenges: with a

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

Data Platforms and Pattern Mining

INTRODUCTION TO BIG DATA, DATA MINING, AND MACHINE LEARNING

Machine Learning using MapReduce

Data warehouse and Data Mining

Data Mining Concepts. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech

Knowledge Discovery and Data Mining

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Chapter 3. Foundations of Business Intelligence: Databases and Information Management

Data Mining Concepts

Data Mining: Models and Methods

D B M G Data Base and Data Mining Group of Politecnico di Torino

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

International Journal of Advance Engineering and Research Development. A Survey on Data Mining Methods and its Applications

Big Data with Hadoop Ecosystem

data-based banking customer analytics

BIG DATA TESTING: A UNIFIED VIEW

Management Information Systems Review Questions. Chapter 6 Foundations of Business Intelligence: Databases and Information Management

Basic Data Mining Technique

Question Bank. 4) It is the source of information later delivered to data marts.

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect

Overview. Introduction to Data Warehousing and Business Intelligence. BI Is Important. What is Business Intelligence (BI)?

ADVANCED ANALYTICS USING SAS ENTERPRISE MINER RENS FEENSTRA

Business Analytics and Big Data: the process and the tools

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Rapid growth of massive datasets

Data Preprocessing. Slides by: Shree Jaswal

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

CS513-Data Mining. Lecture 2: Understanding the Data. Waheed Noor

Data Mining Concepts & Tasks

Big Data Analytics The Data Mining process. Roger Bohn March. 2017

Chapter 1, Introduction

Overview of Big Data

Think & Work like a Data Scientist with SQL 2016 & R DR. SUBRAMANI PARAMASIVAM (MANI)

MATH36032 Problem Solving by Computer. Data Science

Introduction to Text Mining. Hongning Wang

Introduction to Data Mining S L I D E S B Y : S H R E E J A S W A L

COMP90049 Knowledge Technologies

Big Data Its All Around You

Dr. SubraMANI Paramasivam. Think & Work like a Data Scientist with SQL 2016 & R

Basic Concepts Weka Workbench and its terminology

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

9. Conclusions. 9.1 Definition KDD

CS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University

Data Mining. Chapter 1: Introduction. Adapted from materials by Jiawei Han, Micheline Kamber, and Jian Pei

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Mining Social Media Users Interest

Big Data Specialized Studies

Chapter 3: Supervised Learning

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATIONS (MCA) Semester: IV

CS423: Data Mining. Introduction. Jakramate Bootkrajang. Department of Computer Science Chiang Mai University

Fall 2017 ECEN Special Topics in Data Mining and Analysis

Data Set. What is Data Mining? Data Mining (Big Data Analytics) Illustrative Applications. What is Knowledge Discovery?

Chapter 6 VIDEO CASES

Summary. Machine Learning: Introduction. Marcin Sydow

Challenges for Data Driven Systems

INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Data mining fundamentals

BIG DATA TECHNOLOGIES: WHAT EVERY MANAGER NEEDS TO KNOW ANALYTICS AND FINANCIAL INNOVATION CONFERENCE JUNE 26-29,

Chapter 6. Foundations of Business Intelligence: Databases and Information Management VIDEO CASES

Python With Data Science

Machine Learning Algorithms using Parallel Approach. Shri. Aditya Kumar Sinha FDP-Programme Head &Principal Technical officer ACTS, C-DAC Pune

Entities and Attributes. Image 6.2 Image 6.3

Overview. Data Mining for Business Intelligence. Shmueli, Patel & Bruce

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

BIG DATA SCIENTIST Certification. Big Data Scientist

1. Inroduction to Data Mininig

Value Added Association Rules

Preprocessing DWML, /33

Oracle9i Data Mining. Data Sheet August 2002

Data Mining. Introduction. Piotr Paszek. (Piotr Paszek) Data Mining DM KDD 1 / 44

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

INTRODUCTION... 2 FEATURES OF DARWIN... 4 SPECIAL FEATURES OF DARWIN LATEST FEATURES OF DARWIN STRENGTHS & LIMITATIONS OF DARWIN...

Research Article International Journals of Advanced Research in Computer Science and Software Engineering ISSN: X (Volume-7, Issue-6)

Lecture 18. Business Intelligence and Data Warehousing. 1:M Normalization. M:M Normalization 11/1/2017. Topics Covered

劉介宇 國立台北護理健康大學 護理助產研究所 / 通識教育中心副教授 兼教師發展中心教師評鑑組長 Nov 19, 2012


Oracle Big Data. A NA LYT ICS A ND MA NAG E MENT.

Pre-Requisites: CS2510. NU Core Designations: AD

Big Data - Some Words BIG DATA 8/31/2017. Introduction

Web Mining Team 11 Professor Anita Wasilewska CSE 634 : Data Mining Concepts and Techniques

Specialist ICT Learning

Transcription:

1/28/2016 MIST.7060 Data Analytics 1 Introduction to Data Mining and Data Analytics What Are Data Mining and Data Analytics? Data mining is the process of discovering hidden patterns in data, where Patterns refer to inherent relationships and/or dependencies in the data, and Data are typically stored in a structured environment and are large in scale. Data mining is also referred to as knowledge discovery in database (KDD). Data analytics is the process of transforming raw data into knowledge and insight for making better decisions. Related Fields Data mining/analytics is closely related to the fields of database, artificial intelligence, statistics, and information retrieval. But there are considerable differences between data mining and these fields. Database: Focuses on data storage and access technology, while data mining focuses on data analysis and knowledge discovery. Artificial Intelligence (AI): There are overlaps between data mining and AI (including machine learning) techniques. However, AI techniques are not necessarily data-oriented (e.g., expert systems). Statistics: Statistical science assumes data are scarce; it focuses on numeric data and parametric approach (e.g., assume data follows normal distribution). Data mining assumes data are abundant; it deals with various data types and focus on efficient algorithms for large-scale data. Information retrieval (IR) concerns finding materials (e.g., documents) of an unstructured nature (e.g., text) that satisfies an information need; it is closely related to text and web mining. A typical example of IR techniques is a search engine.

1/28/2016 MIST.7060 Data Analytics 2 Terminology Attribute (also called variable or field). Numeric attribute (also called continuous or real attribute): Mathematical operations (e.g., addition, multiplication) can be applied to the values of this type of attribute. Categorical attribute (also called nominal attribute): Mathematical operations cannot be applied to the values of such attribute, even if the values appear in a numeric format (e.g., social security number, credit card number). Record (also called observation or instance). Dataset (relation or relational database table): A set of data with attributes in columns and records in rows. Business Applications Database marketing Credit evaluation Fraud detection Market basket analysis Market segmentation Web usage mining and personalization Data Mining Tasks Supervised learning (where there is a predefined attribute whose values are to be predicted) Classification Prediction (of numeric values) Unsupervised learning (where there is no predefined attribute for prediction) Association (or Association Rules Mining) Clustering (or Cluster Analysis)

1/28/2016 MIST.7060 Data Analytics 3 Classification Classification is the process of assigning data records into one of several predefined groups, called classes. Classification involves building a model (called classifier), which can be a mathematical function, a set of rules, or other representations. Examples of classification: Fraud detection (true or false) Security trading decision (buy, sell, or hold) Medical diagnosis (presence or absence of a disease) Prediction Prediction discovers the relationship between a set of variables, called independent or input variables, and another set of variables, called dependent or output variables in data, so that the past or current values of independent variables can be used to predict the future values of dependent variables. Prediction vs. classification: Prediction: the values of the attribute to be predicted (dependent variable) is numeric. Classification: the values of the attribute to be predicted (class attribute) is categorical. Examples of prediction: Sales volume or revenue prediction Stock price prediction Association Association refers to the presence of one set of items in a group of records implies the existence of another set of items in the same group of records. Examples of association: Market basket analysis (What items were normally bought together in a customer s visit to the store?) Recommender system (Amazon: Customers who bought this item also bought ) Web usage mining (Clickstream analysis. See http://en.wikipedia.org/wiki/clickstream)

1/28/2016 MIST.7060 Data Analytics 4 Clustering Clustering is the process of grouping data records into a number of groups, called clusters, such that records within the same cluster are more similar than those between different clusters. Clustering differs from classification in that groups are formed as a result of the analysis instead of predefined. Examples of clustering: Market segmentation Grouping of library books by field Data Mining Process 1. Problem identification Purpose of the data-mining project and nature of the problem (classification, prediction, clustering, or association?) 2. Data preparation Data collection: retrieving, merging and/or dividing data Data cleaning: correcting errors, handling missing data, resolving inconsistencies Data reduction: sampling (in rows), feature (attribute) selection (in columns) Data transformation: standardizing data, reformatting data, conversion between numeric and categorical data 3. Model formulation and pattern exploration Select appropriate data-mining techniques and tools and use the selected techniques and tools to build models and explore the patterns/relationships hidden in data. 4. Verification and modification Test if the models built are valid; modify the models if necessary. Compare different candidate models. 5. Interpretation and implementation (deployment) Interpret the results of data mining in an intuitive manner. Implement/deploy the model into related applications.

1/28/2016 MIST.7060 Data Analytics 5 Big Data Big data can be characterized by 3Vs: volume, velocity, and variety. Volume: amount and scale of data. Velocity: speed of data in and out. Variety: range and complexity of data types, structures and sources. Examples: Google, Twitter, GPS data, Facebook, YouTube. Financial market: 7 billion shares change hands every day on U.S. markets (MSC p.7). Small Data vs. Bid Data Sampling vs. N = All Causality vs. Correlation Structured (SQL) vs. Unstructured (NoSQL, Not only SQL) MapReduce is a framework for processing large-scale data using parallel and distributed computing technologies with a large number of computers. Apache Hadoop is an open-source implementation of MapReduce. Missing Data Replacement Listwise deletion: disregard a record if it has any missing attribute values. Mean/Mode substitution: For a missing value of a numeric attribute, replace it with the mean of the nonmissing values of that attribute. For a missing value of a categorical attribute, replace it with the mode (most frequent value) of the non-missing values of that attribute. Task-specific missing value replacement methods (we will discuss some of them later in class)

1/28/2016 MIST.7060 Data Analytics 6 Normalizing Numeric Data Transform a series of numeric data into values within range [0, 1], as below: Example: original value min value normalized value = max value min value Original values: 1, 0, 2, and 4. Normalized values: 0, 0.2, 0.6, and 1. Overfitting and Data Partitioning Training set is the portion of data used to build data-mining models. Validation set is the portion of data used to validate or adjust the models, and to prevent overfitting problems. Test set is the data used to evaluate the performance of the models. The test set serves as unseen future data. Overfitting occurs when a model fits the training data very well or even perfectly, but performs poorly when it is applied to unseen future data.