Data Mining Concepts. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech

Similar documents
Introduction to Machine Learning. Duen Horng (Polo) Chau Associate Director, MS Analytics Associate Professor, CSE, College of Computing Georgia Tech

Data Mining Concepts & Tasks

Data Mining Concepts & Tasks

CSE 6242 / CX 4242 Course Review. Duen Horng (Polo) Chau Associate Director, MS Analytics Assistant Professor, CSE, College of Computing Georgia Tech

Text Analytics (Text Mining)

Analytics Building Blocks

Text Analytics (Text Mining)

Thanks to the advances of data processing technologies, a lot of data can be collected and stored in databases efficiently New challenges: with a

D B M G Data Base and Data Mining Group of Politecnico di Torino

Scaling Up HBase. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Data Collection, Simple Storage (SQLite) & Cleaning

Introduction to Data Mining and Data Analytics

Classification Key Concepts

Scaling Up Pig. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Data Integration. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Scaling Up Pig. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Graphs III. CSE 6242 A / CS 4803 DVA Feb 26, Duen Horng (Polo) Chau Georgia Tech. (Interactive) Applications

Graphs / Networks CSE 6242/ CX Interactive applications. Duen Horng (Polo) Chau Georgia Tech

Jarek Szlichta

COMP 465 Special Topics: Data Mining

Graphs / Networks. CSE 6242/ CX 4242 Feb 18, Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

COMP90049 Knowledge Technologies

Data Mining and Data Warehousing Introduction to Data Mining

Knowledge Discovery and Data Mining

Visualization and text mining of patent and non-patent data

Chapter 1, Introduction

Data Mining Course Overview

Introduction to Data Mining S L I D E S B Y : S H R E E J A S W A L

Data mining fundamentals

Summary. Machine Learning: Introduction. Marcin Sydow

Basics, how to build & store graphs, laws, etc. Centrality, and algorithms you should know

Basics, how to build & store graphs, laws, etc. Centrality, and algorithms you should know

Graphs / Networks CSE 6242/ CX Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Scaling Up 1 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. Hadoop, Pig

Graphs / Networks CSE 6242/ CX Basics, how to build & store graphs, laws, etc. Centrality, and algorithms you should know

CPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2018

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Fall 2017 ECEN Special Topics in Data Mining and Analysis

Stats Overview Ji Zhu, Michigan Statistics 1. Overview. Ji Zhu 445C West Hall

Data Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA.

CS423: Data Mining. Introduction. Jakramate Bootkrajang. Department of Computer Science Chiang Mai University

Text Analytics (Text Mining)

Text Analytics (Text Mining)

DATA MINING II - 1DL460

Knowledge Discovery in Data Bases

Machine Learning - Clustering. CS102 Fall 2017

Data Mining Algorithms

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

Machine Learning using MapReduce

Data Mining. Jeff M. Phillips. January 7, 2019 CS 5140 / CS 6140

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

Exploratory Analysis: Clustering

Chapter 3. Foundations of Business Intelligence: Databases and Information Management

DATA MINING II - 1DL460

Data Mining Clustering

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

Classification Key Concepts

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies

Introduction to Data Mining

Data Science Tutorial

Data Mining. Yi-Cheng Chen ( 陳以錚 ) Dept. of Computer Science & Information Engineering, Tamkang University

Chapter 3: Data Mining:

Machine Learning with Python

Data warehouse and Data Mining

Data Mining: Models and Methods

Introduction. Welcome. Machine Learning

Winter Semester 2009/10 Free University of Bozen, Bolzano

Graphs / Networks CSE 6242/ CX Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Data Mining. Chapter 1: Introduction. Adapted from materials by Jiawei Han, Micheline Kamber, and Jian Pei

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

Outline. Project Update Data Mining: Answers without Queries. Principles of Information and Database Management 198:336 Week 12 Apr 25 Matthew Stone

3 Data, Data Mining. Chengkai Li

Data Mining Concepts

1. Inroduction to Data Mininig

Data Mining and Analytics. Introduction

CISC 4631 Data Mining Lecture 01:

Security analytics: From data to action Visual and analytical approaches to detecting modern adversaries

Taking Your Application Design to the Next Level with Data Mining

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA

Demystifying Machine Learning

Knowledge Discovery and Data Mining 1 (VO) ( )

Big Data Analytics The Data Mining process. Roger Bohn March. 2016

An Effectual Approach to Swelling the Selling Methodology in Market Basket Analysis using FP Growth

Knowledge Discovery. Javier Béjar URL - Spring 2019 CS - MIA

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Mining Social Media Users Interest

International Journal of Advance Engineering and Research Development. A Survey on Data Mining Methods and its Applications

An Introduction to Data Mining BY:GAGAN DEEP KAUSHAL

CPSC 340: Machine Learning and Data Mining. Finding Similar Items Fall 2017

What is Unsupervised Learning?

Data Preprocessing. Slides by: Shree Jaswal

DATA MINING INTRO LECTURE. Introduction

Question Bank. 4) It is the source of information later delivered to data marts.

Knowledge Discovery. URL - Spring 2018 CS - MIA 1/22

Data Preprocessing UE 141 Spring 2013

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing

Chapter 4 Data Mining A Short Introduction

Transcription:

http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Data Mining Concepts Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos

http://www.amazon.com/data-science-businessdata-analytic-thinking/dp/1449361323

1. Classification (or Probability Estimation) Predict which of a (small) set of classes an entity belong to. 3

(From previous class) 1. Classification (or Probability Estimation) Predict which of a (small) set of classes an entity belong to. email spam (y, n) sentiment analysis (+, -, neutral) news (politics, sports, ) medical diagnosis (cancer or not) shirt size (s, m, l) face/cat detection face detection (baby, middle-aged, etc) buy /not buy - commerce fraud detection census: gender 4

(From previous class) 1. Classification (or Probability Estimation) Predict which of a (small) set of classes an entity belong to. Cancer testing (yes, no) Movie genre (action, drama, etc.) sports (win, loss) email spam filter (spam, or not) gesture detection (pinch, swipe ) planet zone habitable or not gene prediction news types (sports, entertainment) virus scanning (malware or not) 5

2. Regression ( value estimation ) Predict the numerical value of some variable for an entity. 6

(From previous class) 2. Regression ( value estimation ) Predict the numerical value of some variable for an entity. stock value real estate wine valuation food/commodity sports betting movie ratings forex any product sales energy spread of diseases (regression + geographical info) web/commerce traffic GPA - how much time putting in?? 7

(From previous class) 2. Regression ( value estimation ) Predict the numerical value of some variable for an entity. point value of wine (50-100) credit score (start with classification; default or not) stock prices wall street relationship between price and sales weather sports and game scores 8

3. Similarity Matching Find similar entities (from a large dataset) based on what we know about them. 9

(From previous class) 3. Similarity Matching Find similar entities (from a large dataset) based on what we know about them. dating recommender system (movies, items) customers in marketing price comparison (consumer, find similar priced) comparing emails if they re spam or not? facebook/trend/friends suggestions finding employees similar youtube videos (e.g., more cat videos) similar web pages (find near duplicates or representative sites) ~= clustering plagiarism detection 10

(From previous class) 3. Similarity Matching Find similar entities (from a large dataset) based on what we know about them. recommending items you may want to buy find similar gene sequences (that may be repeating, or does similar things) online dating building auditing (energy consumption) patent search carpool matching (find people to carpool) detecting fake identities 11

4. Clustering (unsupervised learning) Group entities together by their similarity. (User provides # of clusters) 12

(From previous class) 4. Clustering (unsupervised learning) Group entities together by their similarity. (User provides # of clusters) groupings of similar bugs in code optical character recognition unknown vocabulary topical analysis (tweets?) land cover: tree/road/ for advertising: grouping users for marketing purposes fireflies clustering speaker recognition (multiple people in same room) astronomical clustering 13

(From previous class) 4. Clustering (unsupervised learning) Group entities together by their similarity. (User provides # of clusters) cluster people into demographics groups (young, old, etc) cluster people by accents (y all, you all) hierarchical clustering for metabolomics clustering images on the web (cat?) ~=dimensionality reduction 14

5. Co-occurrence grouping (Many names: frequent itemset mining, association rule discovery, market-basket analysis) Find associations between entities based on transactions that involve them (e.g., bread and milk often bought together) http://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a-teen-girlwas-pregnant-before-her-father-did/ 15

6. Profiling / Pattern Mining / Anomaly Detection (unsupervised) Characterize typical behaviors of an entity (person, computer router, etc.) so you can find trends and outliers. Examples? computer instruction prediction removing noise from experiment (data cleaning) detect anomalies in network traffic moneyball weather anomalies (e.g., big storm) google sign-in (alert) smart security camera embezzlement trending articles 16

7. Link Prediction / Recommendation Predict if two entities should be connected, and how strongly that link should be. linkedin/facebook: people you may know amazon/netflix: because you like terminator suggest other movies you may also like 17

8. Data reduction ( dimensionality reduction ) Shrink a large dataset into smaller one, with as little loss of information as possible 1. if you want to visualize the data (in 2D/3D) 2. faster computation/less storage 3. reduce noise 18

Start Thinking About Project! What problems do you want to solve? Using what (large) datasets? What techniques do you need? 19