Data Mining Concepts & Tasks

Similar documents
Data Mining Concepts & Tasks

Data Mining Concepts. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech

Data Integration. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Introduction to Machine Learning. Duen Horng (Polo) Chau Associate Director, MS Analytics Associate Professor, CSE, College of Computing Georgia Tech

Data Collection, Simple Storage (SQLite) & Cleaning

Text Analytics (Text Mining)

Text Analytics (Text Mining)

Classification Key Concepts

Analytics Building Blocks

CSE 6242 / CX 4242 Course Review. Duen Horng (Polo) Chau Associate Director, MS Analytics Assistant Professor, CSE, College of Computing Georgia Tech

Introduction to Data Mining and Data Analytics

Graphs / Networks CSE 6242/ CX Interactive applications. Duen Horng (Polo) Chau Georgia Tech

Scaling Up HBase. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Thanks to the advances of data processing technologies, a lot of data can be collected and stored in databases efficiently New challenges: with a

Scaling Up 1 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. Hadoop, Pig

Scaling Up Pig. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

D B M G Data Base and Data Mining Group of Politecnico di Torino

Classification Key Concepts

Graphs / Networks. CSE 6242/ CX 4242 Feb 18, Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Graphs III. CSE 6242 A / CS 4803 DVA Feb 26, Duen Horng (Polo) Chau Georgia Tech. (Interactive) Applications

Machine Learning using MapReduce

Data Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA.

Scaling Up Pig. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Jarek Szlichta

COMP90049 Knowledge Technologies

Link Mining & Entity Resolution. Lise Getoor University of Maryland, College Park

Based on Big Data: Hype or Hallelujah? by Elena Baralis

Data Mining Clustering

ELEC6910Q Analytics and Systems for Social Media and Big Data Applications Lecture 4. Prof. James She

Graphs / Networks CSE 6242/ CX Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Basic Data Mining Technique

Graphs / Networks CSE 6242/ CX Basics, how to build & store graphs, laws, etc. Centrality, and algorithms you should know

Graphs / Networks CSE 6242/ CX Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Knowledge Discovery and Data Mining

Chapter 1, Introduction

Data Mining Course Overview

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition

Data warehouse and Data Mining

Summary. Machine Learning: Introduction. Marcin Sydow

Epilog: Further Topics

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

Big Data Analytics Influx of data pertaining to the 4Vs, i.e. Volume, Veracity, Velocity and Variety

Introduction to Data Mining S L I D E S B Y : S H R E E J A S W A L

Machine Learning - Clustering. CS102 Fall 2017

Holistic Analysis of Multi-Source, Multi- Feature Data: Modeling and Computation Challenges

CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Data Mining Algorithms

Basics, how to build & store graphs, laws, etc. Centrality, and algorithms you should know

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Basics, how to build & store graphs, laws, etc. Centrality, and algorithms you should know

Mining Social Media Users Interest

Introduction. Welcome. Machine Learning

INTRODUCTION TO BIG DATA, DATA MINING, AND MACHINE LEARNING

Data Warehousing and Data Mining. Announcements (December 1) Data integration. CPS 116 Introduction to Database Systems

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

Exploratory Analysis: Clustering

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

Applying Supervised Learning

CSE4334/5334 DATA MINING

Knowledge Discovery and Data Mining 1 (VO) ( )

MATH36032 Problem Solving by Computer. Data Science

Knowledge Discovery and Data Mining

Mining Web Data. Lijun Zhang

Data Mining and Analytics. Introduction

COMP6237 Data Mining Making Recommendations. Jonathon Hare

Preprocessing DWML, /33

Text Analytics (Text Mining)

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

DATA MINING II - 1DL460

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Text Analytics (Text Mining)

3 Data, Data Mining. Chengkai Li

Data Mining Concepts

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Clustering Basic Concepts and Algorithms 1

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA

COMP 465 Special Topics: Data Mining

Fall 2018 CSE 482 Big Data Analysis: Exam 1 Total: 36 (+3 bonus points)

Hybrid Recommendation System Using Clustering and Collaborative Filtering

Part 12: Advanced Topics in Collaborative Filtering. Francesco Ricci

Visualization and text mining of patent and non-patent data

USC Viterbi School of Engineering

Fall 2017 ECEN Special Topics in Data Mining and Analysis

Stats Overview Ji Zhu, Michigan Statistics 1. Overview. Ji Zhu 445C West Hall

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

Data Warehousing and Machine Learning

CS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

Chapter 3. Foundations of Business Intelligence: Databases and Information Management

Data Warehousing & Mining. Data integration. OLTP versus OLAP. CPS 116 Introduction to Database Systems

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Data Platforms and Pattern Mining

Table Of Contents: xix Foreword to Second Edition

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

COMP 465: Data Mining Classification Basics

data-based banking customer analytics

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

CPSC 340: Machine Learning and Data Mining. Finding Similar Items Fall 2017

Transcription:

Data Mining Concepts & Tasks Duen Horng (Polo) Chau Georgia Tech CSE6242 / CX4242 Sept 9, 2014 Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos

Last Time Collection Cleaning Integration Analysis Visualization Presentation Dissemination Data Cleaning Google Refine, Data Wrangler Data Integration Many examples: Google knowledge graph, Facebook Graph Search, Freebase, Feldspar, Kayak, Apple Siri, etc. 2

Continuing with Data Integration

Freebase (a graph of entities)" a large collaborative knowledge base consisting of metadata composed mainly by its community members Wikipedia. 4

Crowd-sourcing Approaches: Freebase http://wiki.freebase.com/wiki/what_is_freebase%3f 5

What do we need before we can even integrate datasets/tables/schemas? 6

What do we need before we can even integrate datasets/tables/schemas? You need an ID for every unique entity/item/object/thing Easy? 7

What do we need before we can even integrate datasets/tables/schemas? person_id name state_id 1 Smith 111 2 Johnson 222 3 Obama 222 + state_id state_name 111 GA 222 NY 333 CA person_id name state 1 Smith GA 2 Johnson NY 3 Obama NY 8

Entity Resolution (A hard problem in data integration) Polo Chau P. Chau Duen Horng Chau Duen Chau D. Chau 9

Why is Entity Resolution so Important?

D-Dupe Interactive Data Deduplication and Integration TVCG 2008 University of Maryland Bilgic, Licamele, Getoor, Kang, Shneiderman http://linqs.cs.umd.edu/basilic/web/publications/2008/kang:tvcg08/kang-tvcg08.pdf http://www.cs.umd.edu/projects/linqs/ddupe/ (skip to 0:55) 12

Polo Poalo

Numerous similarity functions Euclidean distance Euclidean norm / L2 norm Manhattan distance Jaccard Similarity e.g., overlap of nodes #neighbors " Excellent read: http://infolab.stanford.edu/~ullman/mmds/ch3a.pdf String edit distance e.g., Polo Chau vs Polo Chan Many more 15

Core components: Similarity functions Determine how two entities are similar. D-Dupe s approach: Attribute similarity + relational similarity Similarity score for a pair of entities 16

Attribute similarity (a weighted sum) 17

Summary for data integration Opportunities enable new services (Siri, padmapper) enable new ways to discover info improve existing services reduce redundancy new way to interactive with data promote knowledge transfer (e.g., between companies) 18

Data Mining Concepts & Tasks Collection Cleaning Integration Analysis Visualization Presentation Dissemination Each data-driven (business, decision-making) problem is unique, e.g., different goals, constraints. " Good news: many (sub)tasks that underlie these problems are common " Here is an overview of the common tasks, based on Data Science for Business: What you need to know about data mining and data-analytic thinking " 19

http://www.amazon.com/data-science-business-data-analytic-thinking/dp/1449361323

1. (soft) Classification, Probability Estimation (supervised learning) Predict which of a (small) set of classes an entity belong to. Examples: Is this app malicious or benign? Will this customer click on this ad? More Examples? payment transaction -> fraudulent? news/emails -> spam? tumor -> benign? sentiment analysis -> +, -, neutral weather -> rain, storm, sunny movies genres -> action, etc. friends -> close, acquaintance, etc. online dating -> will work out or not? surveillance system -> suspicious or not 21

2. Regression ( value estimation ) (supervised learning) Predict the numerical value of some variable for an entity. Example: how much minutes will this cellphone customer use? Related to classification, but predict how much, instead of discrete decisions (e.g., yes, no) More Examples? stock prices price of plane tickets weather prediction credit scores time until machine fails (data center) inventory management (supply chain) population change (city, population planning) sports stat (gambling) 22

3. Similarity Matching Find similar entities (from a large dataset) based on what we know about them. Examples? Online dating recommendation systems (similar songs, movies) image classifier (find all sunset images) suggestions for online shopping market segmentation suggestion of friends on facebook online advertisement -> restaurant classification (italian, Chinese) search results (google similar results) search query matching 23

4. Clustering (unsupervised learning) Group entities together by their similarity. (User provides # of clusters) Examples? factors for diseases movie categories (genres; soft clustering) market segmentation for targeted advertisement social network analysis (whether people like the same thing) geographical data (identify neighborhood, popular landmarks) 24

5. Co-occurrence grouping (Many names: frequent itemset mining, association rule discovery, market-basket analysis) Find associations between entities based on transactions that involve them (e.g., bread and milk often bought together) 25

6. Profiling / Pattern Mining / Anomaly Detection (unsupervised) Characterize typical behaviors of an entity (person, computer router, etc.) so you can find trends and outliers. Examples? computer instruction prediction removing noise from experiment (data cleaning) detect anomalies in network traffic moneyball weather anomalies (e.g., big storm) google sign-in (alert) smart security camera embezzlement trending articles 26

7. Link Prediction / Recommendation Predict if two entities should be connected, and how strongly that link should be. Examples? two people on Facebook amazon (things bought together); asssociation-rule mining netflix: recommend jim carey movie related questions on quora top apps on apple store crime group detection (bad guys on social network) google search suggestions 27

8. Data reduction ( dimensionality reduction ) Shrink a large dataset into smaller one, with as little loss of information as possible When to do it? Examples? Why do it? Original data is too big -> too hard to process, or take too long 2D -> 1D (many Ds -> few Ds): for visualization, for more efficient algorithms Graph partitioning - split a large graph into smaller subgraphs 28

Start thinking about project What kind of datasets and problems do you want to solve? What techniques do you need? Will describe project requirements in next lecture 29