Data Mining. Jeff M. Phillips. January 7, 2019 CS 5140 / CS 6140

Size: px

Start display at page:

Download "Data Mining. Jeff M. Phillips. January 7, 2019 CS 5140 / CS 6140"

Leona Wood
5 years ago
Views:

1 Data Mining CS 5140 / CS 6140 Jeff M. Phillips January 7, 2019

2 What is Data Mining?

3 What is Data Mining? Finding structure in data? Machine learning on large data? Unsupervised learning? Large scale computational statistics?

4 What is Data Mining? Finding structure in data? Machine learning on large data? Unsupervised learning? Large scale computational statistics? How to think about data analytics.

5 What is Data Mining? Finding structure in data? Machine learning on large data? Unsupervised learning? Large scale computational statistics? How to think about data analytics. Principals of converting from messy raw data to abstract representations. Algorithms of how to analyze data in abstract representations. Addressing challenges in scalability, error, and modeling.

6 Modeling versus Efficiency Two Intertwined (and often competing) Objectives: Model Data Correctly Process Data Efficiently Statistics Data Mining Algorithms Modeling Efficiency

7 Other Data Mining Courses Every university teaches data mining differently!

8 Other Data Mining Courses Every university teaches data mining differently! What flavor is offered in this class: Focus on techniques for very large scale data Broad coverage... with recent developments Formally and generally presented (proof sketches)... but useful in practice (e.g. internet companies) Probabilistic algorithms: connections to CS and Stat

9 Other Data Mining Courses Every university teaches data mining differently! What flavor is offered in this class: Focus on techniques for very large scale data Broad coverage... with recent developments Formally and generally presented (proof sketches)... but useful in practice (e.g. internet companies) Probabilistic algorithms: connections to CS and Stat no specific software tools / programming languages

10 Other Data Mining Courses Every university teaches data mining differently! What flavor is offered in this class: Focus on techniques for very large scale data Broad coverage... with recent developments Formally and generally presented (proof sketches)... but useful in practice (e.g. internet companies) Probabilistic algorithms: connections to CS and Stat no specific software tools / programming languages Maths: Linear Algebra, Probability, High-dimensional geometry

11 Classic (Old) View of Data Mining labeled data (X, y) regression classification prediction unlabeled data X dimensionality reduction clustering structure scalar outcome set outcome

12 Outline Statistical and Mathematical Principals: 1. Hashing, Concentration of Measure 2. Similarity (find duplicates and similar items) Structure in Data: 3. Clustering (aggregate close items) 4. Regression (linearity of high-d data, sparsity) 5. Dimensionality Reduction (PCA, embeddings) 7. Link Analysis (prominent structure in large graphs) Controlling for Noise and Uncertainty: 6. Noisy Data (anomalies in data, ethics, privacy)

13 Statistical Principals What happens as data is generated with replacement {IP addresses, words in dictionary, edges in graph, hash table} When do items collide? When do you see all items? When is the distribution almost uniform?

14 Statistical Principals What happens as data is generated with replacement {IP addresses, words in dictionary, edges in graph, hash table} When do items collide? When do you see all items? When is the distribution almost uniform?

15 Statistical Principals What happens as data is generated with replacement {IP addresses, words in dictionary, edges in graph, hash table} When do items collide? When do you see all items? When is the distribution almost uniform?

16 Statistical Principals What happens as data is generated with replacement {IP addresses, words in dictionary, edges in graph, hash table} When do items collide? When do you see all items? When is the distribution almost uniform?

17 Raw Data to Abstract Representations How to measure similarity between data? Key idea: data point a quick brown fox jumped joe bob sue age income height K 45K 38K

18 Similarity Given a large set of data P. Given new point q, is q in P? Given a large set of data P. Given new point q, what is the closest point in P to q? q P

19 Clustering How to find groups of similar data. do we need a representative? can groups overlap? what is structure of data/distance?

20 Clustering How to find groups of similar data. do we need a representative? can groups overlap? what is structure of data/distance? Hierarchical clustering : When to combine groups? k-means clustering : k-median, k-center, k-means++ Graph clustering : modularity, spectral

21 Regression Consider a data set P R d, where d is BIG! Want to find linear (or polynomial) function that represents P. degree 3 fit

22 Regression Consider a data set P R d, where d is BIG! Want to find linear (or polynomial) function that represents P. degree 3 fit Least Squares : Common easy approach (polynomial, high-dimensional) L 1 Regression : Sparser, generalizes better, Orthogonal Matching Pursuit Info Recovery : Compressed Sensing

23 Dimensionality Reduction Again consider a data set P R d, where d is BIG! Want to find linear subspace that represents P. a1 =(1.5, 0) S k V T k a3 =(1, 1) a2 =(2, 0.5) A k = U k

24 Dimensionality Reduction Again consider a data set P R d, where d is BIG! Want to find linear subspace that represents P. a1 =(1.5, 0) S k V T k a3 =(1, 1) a2 =(2, 0.5) A k = U k SVD : Linear Algebra basis for PCA Multidimensional Scaling : Fits sets of distances in R k with k small Matrix Sketching: Random Projections, Sampling, FD

25 Noisy Data What to do when data is noisy? Identify it : Find and remove outliers Model it : It may be real, affect answer Exploit it : Differential privacy, Ethics of Data Science

Markov Chains : Models movement in a graph PageRank : How to convert

26 Link Analysis, Graphs How does Google Search work? Converts webpage links into directed graph. Markov Chains : Models movement in a graph PageRank : How to convert graph into important nodes MapReduce : How to scale up PageRank Communities : Other important nodes in graphs

27 Summaries Reducing massive data to small space. Want to retain as much as possible (not specific structure) error guarantees OnePass Sampling : Reservoir Sampling MinCount Hash : Sketching data, abstract features Density Approximation : Quantiles Matrix Sketching : Preprocessing complex data Spanners : graph approximations memory length m CPU word 2 [n] u ω(k, u) ω(p, u)

28 Themes What are course goals? Intuition for data analytics How to model data (convert to abstract data types) How to process data efficiently (balance models with algorithms)

Data Mining. Jeff M. Phillips. January 8, 2014

Data Mining. Jeff M. Phillips. January 8, 2014 Data Mining Jeff M. Phillips January 8, 2014 Data Mining What is Data Mining? Finding structure in data? Machine learning on large data? Unsupervised learning? Large scale computational statistics? Data