Fall 2017 ECEN Special Topics in Data Mining and Analysis

Similar documents
ECEN 689 Special Topics in Data Science for Communications Networks

Introduction to Data Mining and Data Analytics

Correlative Analytic Methods in Large Scale Network Infrastructure Hariharan Krishnaswamy Senior Principal Engineer Dell EMC

Data Mining Course Overview

Instructor: Dr. Mehmet Aktaş. Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies

USC Viterbi School of Engineering

Jarek Szlichta

Clustering Analysis Basics

D B M G Data Base and Data Mining Group of Politecnico di Torino

Intro to Artificial Intelligence

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Based on Big Data: Hype or Hallelujah? by Elena Baralis

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Knowledge Discovery and Data Mining

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

Basic Data Mining Technique

Summary. Machine Learning: Introduction. Marcin Sydow

ECS289: Scalable Machine Learning

: Semantic Web (2013 Fall)

Spatial Analytics Built for Big Data Platforms

Fall 2018 CSE 482 Big Data Analysis: Exam 1 Total: 36 (+3 bonus points)

Discovery of Agricultural Patterns Using Parallel Hybrid Clustering Paradigm

Introduction to Data Mining

Data Mining Concepts. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech

DS504/CS586: Big Data Analytics Big Data Clustering II

Domestic electricity consumption analysis using data mining techniques

Statistics (STAT) Statistics (STAT) 1. Prerequisites: grade in C- or higher in STAT 1200 or STAT 1300 or STAT 1400

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

Network Traffic Measurements and Analysis

Big Data Analytics CSCI 4030

An Overview of Smart Sustainable Cities and the Role of Information and Communication Technologies (ICTs)

Part I: Data Mining Foundations

DATA ANALYSIS I. Types of Attributes Sparse, Incomplete, Inaccurate Data

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

CHAPTER 5 OBJECT ORIENTED IMAGE ANALYSIS

COMP 465 Special Topics: Data Mining

Data Mining Concepts & Tasks

Data: a collection of numbers or facts that require further processing before they are meaningful

Table Of Contents: xix Foreword to Second Edition

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Chapter 6. Foundations of Business Intelligence: Databases and Information Management VIDEO CASES

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Smart Cities & The 4th Industrial Revolution

CS246: Mining Massive Datasets Jure Leskovec, Stanford University


STATISTICS (STAT) Statistics (STAT) 1

Pre-Requisites: CS2510. NU Core Designations: AD

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

BIG DATA SCIENTIST Certification. Big Data Scientist

Data Mining Concepts & Tasks

DATA MINING II - 1DL460

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Chapter 1, Introduction

CSE4334/5334 DATA MINING

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Sensor Based Time Series Classification of Body Movement

Contents. Foreword to Second Edition. Acknowledgments About the Authors

SCHEME OF COURSE WORK. Data Warehousing and Data mining

10-701/15-781, Fall 2006, Final

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

Homework # 4. Example: Age in years. Answer: Discrete, quantitative, ratio. a) Year that an event happened, e.g., 1917, 1950, 2000.

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

CS423: Data Mining. Introduction. Jakramate Bootkrajang. Department of Computer Science Chiang Mai University

INCREASING CLASSIFICATION QUALITY BY USING FUZZY LOGIC

745: Advanced Database Systems

Data Mining. Chapter 1: Introduction. Adapted from materials by Jiawei Han, Micheline Kamber, and Jian Pei

ADVANCED ANALYTICS USING SAS ENTERPRISE MINER RENS FEENSTRA

Data warehouse and Data Mining

The Gain setting for Landsat 7 (High or Low Gain) depends on: Sensor Calibration - Application. the surface cover types of the earth and the sun angle

DS504/CS586: Big Data Analytics Big Data Clustering II

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Where we are. Exploratory Graph Analysis (40 min) Focused Graph Mining (40 min) Refinement of Query Results (40 min)

Data Preprocessing. Data Preprocessing

Supervised and Unsupervised Learning (II)

COLLEGE OF ENGINEERING COURSE AND CURRICULUM CHANGES. October 19, Rathbone Hall. 3:30pm. Undergraduate/Graduate EXPEDITED

CPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2018

Record Linkage using Probabilistic Methods and Data Mining Techniques

CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008

k-means Gaussian mixture model Maximize the likelihood exp(

Spatial Analysis and Modeling (GIST 4302/5302) Guofeng Cao Department of Geosciences Texas Tech University

Data Science Bootcamp Curriculum. NYC Data Science Academy

Land Cover Classification Techniques

Yunfeng Zhang 1, Huan Wang 2, Jie Zhu 1 1 Computer Science & Engineering Department, North China Institute of Aerospace

Massive Data Analysis

Business Analytics and Big Data: the process and the tools

The k-means Algorithm and Genetic Algorithm

Keyword Extraction by KNN considering Similarity among Features

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Infinite data. Filtering data streams

3.Data Abstraction. Prof. Tulasi Prasad Sariki SCSE, VIT, Chennai 1 / 26

Clustering. Huanle Xu. Clustering 1

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction

University of Florida CISE department Gator Engineering. Clustering Part 5

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

Preface MOTIVATION ORGANIZATION OF THE BOOK. Section 1: Basic Concepts of Graph Theory

Transcription:

Fall 2017 ECEN 689-600 Special Topics in Data Mining and Analysis Nick Duffield Department of Electrical & Computer Engineering Teas A&M University

Organization

Organization Instructor: Nick Duffield, Professor ECE Contact: duffieldng@tamu.edu ; (979) 845-7328 Course materials on Google Team Drive Access for registered students, must be logged into tamu.edu account Class times: MWF 3:00-3:50pm, ETB 1035 Office hours: WEB 332D, Mon11:00am; Tue 1:00pm Prerequisites: some previous course involving probability or statistics is helpful Grading: Assignments 50%; Tests 20%; Project 10%; Final Eam 20% Discussion of assignments is encouraged, but copying is not allowed. Assignments must be handed in on time to receive full credit

Eternal course materials Primary course tet [ZM] Data Mining and Analysis: Fundamental Concept and Algorithms, M. Zaki & W. Meira, Jr. Cambridge University Press, 2014, http://www.dataminingbook.info/pmwiki.php, Book, slides, datasets Book available through TAMU Library http://proquest.safaribooksonline.com/9781107779105 Secondary course tet [MMDS] Mining of Massive Datasets, J. Leskovec, A. Rajaraman, & J. Ullman http://mmds.org/ Assignments Selected eamples from or based on [ZM] Selected data projects from [ZM] Typically applying or implementing computation Solutions = code + numerical results

Objectives of the course Description Overview of data mining, integrating related concepts from machine learning and statistics. Fundamental topics including eploratory data analysis, pattern mining, clustering, and classification, with applications to scientific and online data. Learning outcomes Acquire knowledge of foundations and application of methods in data mining and data analysis. Prepare students to use methods and tools of data science in research, whether focused on methods or on applications.

Course Topics Week Topic Required Reading 1 Data and Attributes ZM Chapters 2, 3 2 Dimensionality Reduction ZM Chapter 7 3 Frequent Itemset Mining & ZM Chapter 8, MMDS Chap. 6 Association Rules 4 Representative Clustering MMDS, Chapter 7 5 Gaussian Miture Clustering & ZM Chapter 13 EM Method 6 Hierarchical Clustering ZM Chapter 14 7 Density Estimation & Density- ZM Chapter 15 Based Clustering 8 Bayesian & Nearest Neighbor ZM Chapter 18 Classification 9 Decision Tree Classification ZM Chapter 19 10 PageRank, Graphs & Search MMDS Chapters 5 & 9 11 Recommendation Systems MMDS Chapter 9 12 Community Detection MMDS Chapter 10 13 Perceptrons & Boosting MMDS Chapter 11 14 Support Vector Machines MMDS Chapter 11

About me Joined Teas A&M in August 2014 Worked for 18 years in AT&T Labs-Research in New Jersey Research in data science for communications networks Previously Asst. Professor in Europe Undergrad/PhD in Physics and Mathematical Physics Research Interests: Measurement and analysis of communications networks Measurements in software defined networks Network security / cybersecurity Big Data Science Streaming algorithms Statistical Inference, Machine Learning Applications in Transportation, Geosciences

A Quick Tour of the Course

A Quick Tour of the Course What is driving the recent growth and abundance of data? What does data look like? What questions are we trying to answer? Eploratory data analysis Dealing with high dimensional data Finding patterns in the data: clustering Learning and classification

Drivers for Recent Rapid Growth in Data Instrumentation of the internet & internet-based services Operational data from online social networks Billions of users; trillions of connections; Search logs at search providers such as Google Internet Service Providers Traffic measurements: #bytes, #packets between network endpoint pairs Internet of Things Huge increase in number of connected endpoints Retail transactions Online retailers, movie purchases

Drivers for Recent Rapid Growth in Data (2) Increased Sensing in Science, Health, Engineering Satellite Imaging Land use, elevation, slope, soil type & moisture, vegetation, radiance Unmanned Aerial Vehicles Agricultural imaging down to scale of individual plants Weather & Climate Wind velocity, precipitation, humidity, temperature, Crowdsourcing (Internet connected weather stations) Ocean sensing: temperature, currents Transportation Location from apps, cell-towers, vehicle state from smart cars Health & Medicine Genetic sequencing, phenotypes, clinical records, treatments, outcomes, environment & economics, lifestyle Energy and Utilities Seismological data from oil eploration Infrastructure sensors, household smart meters

General Properties of Data Most data in this course is in form of a set of records E.g. Data matri D: Each row = record containing values of d of attributes Such as measured values from n individuals in a survey E.g. each record is a list of items E.g. list of items purchased in a transaction These are eamples of structured data Organized data, e.g. in database with specified field formats Information readily searched / retrieved / inserted Contrast with unstructured data E.g. audio files, images, videos, free tet

How to make model of the data? Data records = (age, height, weight, ) from clinical surveys of children What is relation between variables? Model distribution of height and weight as a function of age Used to make clinical growth charts Eploratory data analysis For each age, characterize mean, variance, covariance amongth height, weight, Multivariate Gaussian model

Find patterns in the data Dataset = retail transactions, each listing items purchase These are called itemsets What are the most frequently purchased items? Which items are frequently purchased together? Information used to advertising, special offers, recommendations If item A more often bought with item B that without, market together How do we find all the frequent itemsets of a given size?

The Challenge of Data Size Datasets may be inherently large Data matri with large number n of records See previous eamples Computations can have high possible size n**k possible itemsets of size k built from n objects Abstract specification ( find frequent itemsets ) often needs careful implementation to compute efficiently

Challenges of High Data Dimensionality Data may have large number of attributes Data matri with many columns d Challenges Models have large number of parameter d(d-1)/2 variances / covariances in a Gaussian model Cumbersome, more difficult to estimate well Dimension reduction Identify small number of attributes (or combinations of attributes) that account for most of the data variation Other dimensions of the data are viewed as noise,

Clustering In many data sets, the points evidently disjoin into groups Different subpopulations within data, each with own data characteristic Want to automate identification of clusters in data Different notions of clustering Representative: cluster points close to a central representative Density based: separated groups of contiguous points Hierarchical clustering: grouping at multiple levels

Classification Clustering is type of unsupervised machine learning We have no side information on variables that may distinguish clusters Classification For each data point we are given a class y Eample: data point = (height, length) of plant leaf class y = plant variety Want to learn the relationship between the data values and the class A classifier is a function that can be used to predict the class of any possible data point Supervised machine learning: Compute a classifier from set of (data, class) values {( i, y i ): i=1,,n} Method include: Bayesian, Support Vector Machines, Linear Perceptrons, Boosting

Graphical Data Mining Many interesting dataset can be represented as graphs Eample: web graph Verte = web page Directed edge (a,b) iff page a contains a hyperlink to edge b Web search problem: how to rank pages by importance Content based criteria: is page content relevant for search, e.g. matching keywords Topology based criteria: Are there many hyperlinks to that page? Google s PageRank algorithms based on random walk of web graph