CS573 Data Privacy and Security. Differential Privacy tabular data and range queries. Li Xiong
|
|
- Cory Wiggins
- 5 years ago
- Views:
Transcription
1 CS573 Data Privacy and Security Differential Privacy tabular data and range queries Li Xiong
2 Outline Tabular data and histogram/range queries Algorithms for low dimensional data Algorithms for high dimensional data
3 Example: cohort discovery from medical records Histograms Cohort discovery: range queries Select COUNT(*) from D Where A1 in I1 and A2 in I2 and and Am in Im.
4 Example: statistical agencies: data publishing A marginal over attributes A 1,, A k reports count for each combination of attribute values. aka cube, contingency table E.g. 2-way marginal on EmploymentStatus and Gender U.S. Census Bureau statistics can typically be derived from k-way marginal over different combinations of available attributes Hundreds of marginals released Module 3 Tutorial: Differential Privacy in the Wild 4
5 Example: range queries over spatial data Input: sensitive data D Input: range query workload W Shown is workload of 3 range queries BeijingTaxi dataset[1]: 4,268,780 records of (lat,lon) pairs of taxi pickup locations in Beijing, China in 1 month. Scatter plot of input data Task: compute answers to workload W over private input D Module 3 [1]Raw data from: Taxi trajectory open Tutorial: Differential Privacy in the Wild dataset, Tsinghua university, China
6 Problem variant: offline vs. online Offline (batch): Entire W given as input, answers computed in batch Online (adaptive): W is sequence q 1, q 2, that arrives online Adaptive: analyst s choice for q i can depend on answers a 1,, a i 1 Module 3 Tutorial: Differential Privacy in the Wild 6
7 Important aspects of problem: Data and query complexity Data complexity Dimensionality: number of attributes Domain size: number of distinct attribute combinations Many techniques specialized for low dimensional data Query complexity Given query workload vs. no query workload Classes of queries: histograms, count queries, linear queries (sum, average), median Module 3 Tutorial: Differential Privacy in the Wild 7
8 Solution variants: query answers vs. synthetic data Two high-level approaches to solving problem 1. Direct: Output of the algorithm is list of query answers 2. Synthetic data: Algorithm constructs a synthetic dataset D, which can be queried directly by analyst Analyst can pose additional queries on D (though answers may not be accurate) Module 3 Tutorial: Differential Privacy in the Wild 8
9 Synthetic Data: Categories of Methods Nonparametric methods release empirical distributions, i.e. histograms with differential privacy Parametric and semi-parametric methods learn parameters of a distribution with differential privacy
10 Outline Tabular data and histogram/range queries Algorithms for low dimensional data Baseline Partitioning algorithms: kd tree, quad tree, Transformation: Wavelet, Fourier Transform, An evaluation framework: DPBench Algorithms for high dimensional data
11 Baseline algorithm: IDENTITY Scatter plot of input data 1. Discretize attribute domain into cells 2. Add noise to cell counts (Laplace mechanism) unit histogram 3. Use noisy counts to either 1. Answer queries directly (assume distribution is uniform within cell) 2. Generate synthetic data (derive distribution from counts and sample) Module 3 Tutorial: Differential Privacy in the Wild
12 Baseline algorithm: IDENTITY Scatter plot of input data Limitations Granularity of discretization Coarse: detail lost Fine: noise overwhelms signal Noise accumulates: squared error grows linearly with range 1. Discretize attribute domain into cells 2. Add noise to cell counts (Laplace mechanism) unit histogram 3. Use noisy counts to either 1. Answer queries directly (assume distribution is uniform within cell) 2. Generate synthetic data (derive distribution from counts and sample) Module 3 Tutorial: Differential Privacy in the Wild
13 [HMMCZ16] Empirical benchmarks An evaluation framework for standardized evaluation of privacy algorithms for range queries (over 1 and 2D) Demo:
14 Data-Dependent Partitioning Domain-based (data-independent) partitioning does not work very well Equi-width: equal bucket range Uniformity assumption Data-driven partitioning V-optimal: with the least frequency variance Intuition: highest uniformity within each bucket How to do it with differential privacy? October 2,
15 Histograms (review) Divide data into buckets and store average (sum) for each bucket Partitioning rules: Equi-width: equal bucket range Equi-depth: equal frequency V-optimal: with the least frequency variance October 2,
16 An Early Attempt: DPCube [SDM 2010, ICDE 2012 demo] Name Age Income HIV+ Frank 42 30K Y Bob 31 60K Y Mary 28 20K Y ε/2-dp Original Records 1. Compute unit histogram with differential privacy 2. kd-tree partitioning 3. Compute merged bin counts with differential privacy ε/2-dp DP Interface DP unit Histogram DP V-optimal Histogram Multi-dimensional partitioning
17 kd-tree based partitioning Choose dimension and splitting point to split (minimize variance) Repeat until: Count of this node less than threshold Variance or entropy of this node less than threshold
18 DPCube [SecureDM 2010, ICDE 2012 demo] Name Age Income HIV+ Frank 42 30K Y Bob 31 60K Y Mary 28 20K Y ε/2-dp Original Records Limitations: DP unit histogram very noisy Affects the accuracy of partitioning Sequential composition ε/2-dp DP unit Histogram Multi-dimensional partitioning DP V-optimal Histogram DP Interface
19 A Later Improvement: Private Spatial decompositions [CPSSY 12] quadtree kd-tree Approach: (top down) partitioning with differential privacy Quad tree and hybrid/kd-tree
20 Building a Private kd-tree Process to build a private kd-tree Input: maximum height h, minimum leaf size L, data set Choose dimension to split Get (private) median in this dimension Create child nodes and add noise to the counts Recurse until: Max height is reached Noisy count of this node less than L Budget along the root-leaf path has used up The entire PSD satisfies DP by the composition property 21
21 Building a Private kd-tree Process to build a private kd-tree Input: maximum height h, minimum leaf size L, data set Choose dimension to split Get (private) median in this dimension exponential mechanism with utility function(x) = rank(x) rank(median) Create child nodes and add noise to the counts Recurse until: Max height is reached Noisy count of this node less than L Budget along the root-leaf path has used up The entire PSD satisfies DP by the composition property 22
22 Building Private Spatial Decompositions privacy budget allocation Budget is split between medians and counts at each node Tradeoff accuracy of division with accuracy of counts Budget is split across levels of the tree Privacy budget used along any root-leaf path should total Optimal budget allocation Post processing with consistency check Sequential composition Parallel composition 23
23 Data-dependent partitioning Heuristics based methods Kd-tree, quad-tree Optimal methods V-optimal histogram (1D or 2D) Module 3 Tutorial: Differential Privacy in the Wild 24
24 Data-aware/Workload-Aware Mechanism [LHMY14] Step 1: dynamic programming based methods for optimal partitioning Step 2: matrix mechanism for optimal noise given a query workload
25 Data Transformations Can think of trees as a data-dependent transform of input Can apply other data transformations General idea: Apply transform of data Add noise in the transformed space (based on sensitivity) Publish noisy coefficients, or invert transform (post-processing) Goal: pick a transform that preserves good properties of data And which has low sensitivity, so noise does not corrupt Original Data Transform Noise Noisy Coefficients Coefficients Invert Private Data 26
26 [HMMCZ16] Empirical benchmarks An evaluation framework for standardized evaluation of privacy algorithms for range queries (over 1 and 2D) Demo: Key findings: Scale/size and shape of data significantly affect algorithm error In a high signal regime (high scale, high epsilon), simpler data independent methods such as IDENTITY works well In a low signal regime (low scale, low epsilon), datadependent algorithm should be considered but no guarantees While no algorithm universally dominates across settings, DAWA is a competitive choice on most datasets
27 Programming Assignment and Competition: Laplace mechanism for Range queries Required: Implement the baseline IDENTITY histogram algorithm Evaluate accuracy for random set of range queries Optional: Optimizations and enhancement Competition
28 Outline Tabular data and histogram/range queries Algorithms for low dimensional data Baseline Partitioning algorithms: kd tree, quad tree, An evaluation framework: DPBench Algorithms for high dimensional data Copula functions [LXJ14] Bayesian networks [ZCPSX14]
29 Traditional Approaches Parametric methods Fit the data to a distribution, make inferences about parameters e.g. PrivacyOnTheMap Non-parametric methods Original data Synthetic data Perturbation Histogram Learn empirical distribution through histograms e.g. PSD, Privelet, FP, P-HP
30 Semi-parametric modeling using Copula Semi-parametric methods functions Haoran Li, Li Xiong, Xiaoqian Jiang. Differentially Private Synthesization of Multi-Dimensional Data using Copula Functions, EDBT 2014
31 Gaussian copula: models the dependence with arbitrary margins Gaussian distribution: models the joint distribution
32 DP marginal histograms Original data set Age Hours /week Income K K K K MLE Age Hours/week Income Step 1: Computing DP marginal Histograms 1 ~ P DP correlation matrix 1 DP dependence structure DP synthetic data set Age Hours /week Income K K K K Step 3: Sampling DP synthetic data Step 2: Computing DP correlation matrix through DP MLE (Maximum Likelihood Estimation
33 Overview Age Hours /week Income Gender K F K M K F K M Gender = F Gender = M Age Hours /week Income Age Hours /week Income K K K K DPCopula DPCopula n~ n Lap(1/ ) Age Hours /week Income K K ~n K K 2 Age Hours /week Income n~ n Lap(1/ ) ~n
34 Datasets US Census data: 4 attributes, 100,000 records Brazil data: 8 attributes, 188,846 records Synthetic data Comparison: PSD, Privelet+, FP, P-HP Metrics: Random range-count queries with random query predicates covering all attributes Relative error: Absolute error:
35 Query accuracy vs. differential privacy budget
36 Gaussian dependence assumption Pair-wise attribute correlation does not scale with high dimensions Works well for continuous data or attributes with large domains
37 Tabular data and histogram/range queries Algorithms for low dimensional data Baseline Partitioning algorithms: kd tree, quad tree, An evaluation framework: DPBench Algorithms for high dimensional data Copula functions [LXJ14] Bayesian networks [ZCPSX14]
38 convert + noise sample sensitive database D full-dim tuple distribution noisy distribution synthetic database D approximate convert + noise sample a set of low-dim distributions noisy low-dim distributions
39 Bayesian network example P ( B ) P ( E ) t f t f Burglary Earthquake Alarm P ( A B, E ) B E t f t t t f f t f f P ( J A) A t f t f JohnCalls MaryCalls P ( M A) A t f t f
40 A 5-dimensional database: Pr age age Pr work age workclass education Pr edu age title Pr title work income Pr income work
41 A 5-dimensional database: age workclass income education title Pr Pr age Pr work age Pr edu age Pr title work Pr income work
42 STEP 1: Choose a suitable Bayesian network N must in a differentially private way STEP 2: Compute conditional distributions implied by N straightforward to do under differential privacy inject noise Laplace mechanism STEP 3: Generate synthetic data by sampling from N post-processing: no privacy issues
43 Finding optimal 1-degree Bayesian network was solved in [Chow-Liu 68]. It is a DAG of maximum in-degree 1, and maximizes the sum of mutual information I of its edges finding the maximum spanning tree, where the weight of edge (X, Y) is mutual information I(X, Y).
44
45 Build a 1-degree BN for database A B C D Alan Bob Cykie David Eric Frank George Helen Ivan Jack
46 Start from a random attribute A A C B D
47 Select next tree edge by its mutual information A B A B C D Alan 0 0 C Bob Cykie David Eric Frank George Helen D Ivan Jack candidates: A B A C A D
48 Select next tree edge by its mutual information A I = 1 I = 0. 4 I = 0 C candidates: A B A C A D B D
49 Select next tree edge by its mutual information A C B D
50 Select next tree edge by its mutual information A B I = 0 I = 0. 4 I = 0. 2 I = 0 C D candidates: A C A D B C B D
51 Select next tree edge by its mutual information A C DONE! B D
52 Do it under Differential Privacy! (Non-private) select the edge with maximum I (Private) I is data-sensitive -> the best edge is also data-sensitive
53 Databases D Edges e define q D, e R How good edge e is as the result of selection, given database D Return e with probability: Pr[e] exp ε 2 q D, e Δ q info noise where Δ q = max D,D,e q D, e q(d, e) 1
54 STEP 1: Choose a suitable Bayesian network N must in a differentially private way STEP 2: Compute conditional distributions implied by N straightforward to do under differential privacy inject noise Laplace mechanism STEP 3: Generate synthetic data by sampling from N post-processing: no privacy issues
55 Outline Tabular data and histogram/range queries Algorithms for low dimensional data Baseline Partitioning algorithms: kd tree, quad tree, Transformation: Wavelet, Fourier Transform, Algorithms for high dimensional data Copula functions Bayesian networks
56 Open questions High dimensional data Robust and private algorithm selection Error bounds for data-dependent algorithms Module 3 Tutorial: Differential Privacy in the Wild 70
57 References [ACC12] Aćs et al. Differentially private histogram publishing through lossy compression. In ICDM, [BBDS12] Blocki et al. The johnson-lindenstrauss transform itself preserves differential privacy. In FOCS, [BCDKMT07] Barak et al. Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In PODS, [BLR08] Blum et al. A learning theory approach to noninteractive database privacy. In STOC, [DNRR15] Dwork et al. Pure Differential Privacy for Rectangle Queries via Private Partitions. In ASIACRYPT, [CPSSY12] Cormode et al. Differentially Private Spatial Decompositions. In ICDE, [GAHRW14] Gaboardi et al. Dual Query: Practical Private Query Release for High Dimensional Data. In ICML, [HLM12] Hardt et al. A simple and practical algorithm for differentially private data release. In NIPS, [HMMCZ16] Hay et al. Principled Evaluation of Differentially Private Algorithms using DPBench. In SIGMOD, [HRMS10] Hay et al. Boosting the accuracy of differentially private histograms through consistency. In PVLDB, [LHMY14] Li et al. A data- and workload-aware algorithm for range queries under differential privacy. In PVLDB, [LHRMM10] Li et al. Optimizing linear counting queries under differential privacy. In PODS, [LM12] Li et al. An adaptive mechanism for accurate query answering under differential privacy. In PVLDB, [LM13] Li et al. Optimal error of query sets under the differentially-private matrix mechanism. In ICDT, [LZWY11] Li et al. Compressive mechanism: utilizing sparse representation in differential privacy. In WPES, [QYL13] Qardaji et al. Understanding hierarchical methods for differentially private histograms. In PVLDB, [QYL13] Qardaji et al. Differentially private grids for geospatial data. In ICDE, [RN10] Rastogi et al. Differentially private aggregation of distributed time-series with transformation and encryption. In SIGMOD, [WWLTRD09] Wang et al. Privacy-preserving genomic computation through program specialization. In CCS, [XWG10] Xiao et al. Differential privacy via wavelet transforms. In ICDE, [ZCPSX14] Zhang et al. PrivBayes: private data release via bayesian networks. In SIGMOD, [ZXX16] Zhang et al. PrivTree: A Differentially Private Algorithm for Hierarchical Decompositions. In SIGMOD, Module 3 Tutorial: Differential Privacy in the Wild 71
Matrix Mechanism and Data Dependent algorithms
Matrix Mechanism and Data Dependent algorithms CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 9 : 590.03 Fall 16 1 Recap: Constrained Inference Lecture 9 : 590.03 Fall 16 2 Constrained Inference
More informationDifferentially Private H-Tree
GeoPrivacy: 2 nd Workshop on Privacy in Geographic Information Collection and Analysis Differentially Private H-Tree Hien To, Liyue Fan, Cyrus Shahabi Integrated Media System Center University of Southern
More informationCo-clustering for differentially private synthetic data generation
Co-clustering for differentially private synthetic data generation Tarek Benkhelif, Françoise Fessant, Fabrice Clérot and Guillaume Raschia January 23, 2018 Orange Labs & LS2N Journée thématique EGC &
More informationStatistical and Synthetic Data Sharing with Differential Privacy
pscanner and idash Data Sharing Symposium UCSD, Sept 30 Oct 2, 2015 Statistical and Synthetic Data Sharing with Differential Privacy Li Xiong Department of Mathematics and Computer Science Department of
More informationDifferentially Private Multi- Dimensional Time Series Release for Traffic Monitoring
DBSec 13 Differentially Private Multi- Dimensional Time Series Release for Traffic Monitoring Liyue Fan, Li Xiong, Vaidy Sunderam Department of Math & Computer Science Emory University 9/4/2013 DBSec'13:
More informationPrivacy Preserving Machine Learning: A Theoretically Sound App
Privacy Preserving Machine Learning: A Theoretically Sound Approach Outline 1 2 3 4 5 6 Privacy Leakage Events AOL search data leak: New York Times journalist was able to identify users from the anonymous
More informationCS573 Data Privacy and Security. Differential Privacy. Li Xiong
CS573 Data Privacy and Security Differential Privacy Li Xiong Outline Differential Privacy Definition Basic techniques Composition theorems Statistical Data Privacy Non-interactive vs interactive Privacy
More informationData Preprocessing. Slides by: Shree Jaswal
Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data
More informationUNIT 2 Data Preprocessing
UNIT 2 Data Preprocessing Lecture Topic ********************************************** Lecture 13 Why preprocess the data? Lecture 14 Lecture 15 Lecture 16 Lecture 17 Data cleaning Data integration and
More informationPrivacy Preserving Data Publishing: From k-anonymity to Differential Privacy. Xiaokui Xiao Nanyang Technological University
Privacy Preserving Data Publishing: From k-anonymity to Differential Privacy Xiaokui Xiao Nanyang Technological University Outline Privacy preserving data publishing: What and Why Examples of privacy attacks
More informationMining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams
Mining Data Streams Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction Summarization Methods Clustering Data Streams Data Stream Classification Temporal Models CMPT 843, SFU, Martin Ester, 1-06
More informationDifferentially Private Multi-Dimensional Time Series Release for Traffic Monitoring
Differentially Private Multi-Dimensional Time Series Release for Traffic Monitoring Liyue Fan, Li Xiong, and Vaidy Sunderam Emory University Atlanta GA 30322, USA {lfan3,lxiong,vss}@mathcs.emory.edu Abstract.
More informationDemonstration of Damson: Differential Privacy for Analysis of Large Data
Demonstration of Damson: Differential Privacy for Analysis of Large Data Marianne Winslett 1,2, Yin Yang 1,2, Zhenjie Zhang 1 1 Advanced Digital Sciences Center, Singapore {yin.yang, zhenjie}@adsc.com.sg
More informationCOMP 465: Data Mining Still More on Clustering
3/4/015 Exercise COMP 465: Data Mining Still More on Clustering Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Describe each of the following
More informationDifferentially Private H-Tree
Differentially Private H-Tree Hien To, Liyue Fan, Cyrus Shahabi Integrated Media Systems Center University of Southern California Los Angeles, CA, U.S.A {hto,liyuefan,shahabi}@usc.edu ABSTRACT In this
More informationCS 521 Data Mining Techniques Instructor: Abdullah Mueen
CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 2: DATA TRANSFORMATION AND DIMENSIONALITY REDUCTION Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major Tasks
More informationPufferfish: A Semantic Approach to Customizable Privacy
Pufferfish: A Semantic Approach to Customizable Privacy Ashwin Machanavajjhala ashwin AT cs.duke.edu Collaborators: Daniel Kifer (Penn State), Bolin Ding (UIUC, Microsoft Research) idash Privacy Workshop
More informationSummary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4
Principles of Knowledge Discovery in Data Fall 2004 Chapter 3: Data Preprocessing Dr. Osmar R. Zaïane University of Alberta Summary of Last Chapter What is a data warehouse and what is it for? What is
More informationData Anonymization. Graham Cormode.
Data Anonymization Graham Cormode graham@research.att.com 1 Why Anonymize? For Data Sharing Give real(istic) data to others to study without compromising privacy of individuals in the data Allows third-parties
More informationDifferentially Private Spatial Decompositions
Differentially Private Spatial Decompositions Graham Cormode Cecilia Procopiuc Divesh Srivastava AT&T Labs Research {graham, magda, divesh}@research.att.com Entong Shen Ting Yu North Carolina State University
More informationData Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha
Data Preprocessing S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha 1 Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking
More informationMultidimensional Indexes [14]
CMSC 661, Principles of Database Systems Multidimensional Indexes [14] Dr. Kalpakis http://www.csee.umbc.edu/~kalpakis/courses/661 Motivation Examined indexes when search keys are in 1-D space Many interesting
More informationFMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu
FMA901F: Machine Learning Lecture 3: Linear Models for Regression Cristian Sminchisescu Machine Learning: Frequentist vs. Bayesian In the frequentist setting, we seek a fixed parameter (vector), with value(s)
More informationPrivacy-preserving machine learning. Bo Liu, the HKUST March, 1st, 2015.
Privacy-preserving machine learning Bo Liu, the HKUST March, 1st, 2015. 1 Some slides extracted from Wang Yuxiang, Differential Privacy: a short tutorial. Cynthia Dwork, The Promise of Differential Privacy.
More informationData Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality
Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation = noisy: containing
More informationCS570: Introduction to Data Mining
CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu, Ph.D. Some slides courtesy of Li Xiong, Ph.D. and 2011 Han, Kamber & Pei. Data Mining. Morgan Kaufmann.
More information2. Data Preprocessing
2. Data Preprocessing Contents of this Chapter 2.1 Introduction 2.2 Data cleaning 2.3 Data integration 2.4 Data transformation 2.5 Data reduction Reference: [Han and Kamber 2006, Chapter 2] SFU, CMPT 459
More information3. Data Preprocessing. 3.1 Introduction
3. Data Preprocessing Contents of this Chapter 3.1 Introduction 3.2 Data cleaning 3.3 Data integration 3.4 Data transformation 3.5 Data reduction SFU, CMPT 740, 03-3, Martin Ester 84 3.1 Introduction Motivation
More informationHolistic and Compact Selectivity Estimation for Hybrid Queries over RDF Graphs
Holistic and Compact Selectivity Estimation for Hybrid Queries over RDF Graphs Authors: Andreas Wagner, Veli Bicer, Thanh Tran, and Rudi Studer Presenter: Freddy Lecue IBM Research Ireland 2014 International
More informationAlgorithmic Approaches to Preventing Overfitting in Adaptive Data Analysis. Part 1 Aaron Roth
Algorithmic Approaches to Preventing Overfitting in Adaptive Data Analysis Part 1 Aaron Roth The 2015 ImageNet competition An image classification competition during a heated war for deep learning talent
More informationData Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners
Data Mining 3.5 (Instance-Based Learners) Fall 2008 Instructor: Dr. Masoud Yaghini Outline Introduction k-nearest-neighbor Classifiers References Introduction Introduction Lazy vs. eager learning Eager
More informationSum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015
Sum-Product Networks STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015 Introduction Outline What is a Sum-Product Network? Inference Applications In more depth
More informationPreprocessing Short Lecture Notes cse352. Professor Anita Wasilewska
Preprocessing Short Lecture Notes cse352 Professor Anita Wasilewska Data Preprocessing Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept
More informationDecision Trees Dr. G. Bharadwaja Kumar VIT Chennai
Decision Trees Decision Tree Decision Trees (DTs) are a nonparametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target
More informationDistributed Private Data Collection at Scale
Distributed Private Data Collection at Scale Graham Cormode g.cormode@warwick.ac.uk Tejas Kulkarni (Warwick) Divesh Srivastava (AT&T) 1 Big data, big problem? The big data meme has taken root Organizations
More informationParallel Composition Revisited
Parallel Composition Revisited Chris Clifton 23 October 2017 This is joint work with Keith Merrill and Shawn Merrill This work supported by the U.S. Census Bureau under Cooperative Agreement CB16ADR0160002
More informationHomework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:
Homework Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression 3.0-3.2 Pod-cast lecture on-line Next lectures: I posted a rough plan. It is flexible though so please come with suggestions Bayes
More informationDetecting Novel Associations in Large Data Sets
Detecting Novel Associations in Large Data Sets J. Hjelmborg Department of Biostatistics 5. februar 2013 Biostat (Biostatistics) Detecting Novel Associations in Large Data Sets 5. februar 2013 1 / 22 Overview
More informationAlgorithms for GIS:! Quadtrees
Algorithms for GIS: Quadtrees Quadtree A data structure that corresponds to a hierarchical subdivision of the plane Start with a square (containing inside input data) Divide into 4 equal squares (quadrants)
More informationwith BLENDER: Enabling Local Search a Hybrid Differential Privacy Model
BLENDER: Enabling Local Search with a Hybrid Differential Privacy Model Brendan Avent 1, Aleksandra Korolova 1, David Zeber 2, Torgeir Hovden 2, Benjamin Livshits 3 University of Southern California 1
More informationSecurity Control Methods for Statistical Database
Security Control Methods for Statistical Database Li Xiong CS573 Data Privacy and Security Statistical Database A statistical database is a database which provides statistics on subsets of records OLAP
More informationHierarchical Intelligent Cuttings: A Dynamic Multi-dimensional Packet Classification Algorithm
161 CHAPTER 5 Hierarchical Intelligent Cuttings: A Dynamic Multi-dimensional Packet Classification Algorithm 1 Introduction We saw in the previous chapter that real-life classifiers exhibit structure and
More informationDifferentially Private Histogram Publication
Noname manuscript No. will be inserted by the editor) Differentially Private Histogram Publication Jia Xu Zhenjie Zhang Xiaokui Xiao Yin Yang Ge Yu Marianne Winslett Received: date / Accepted: date Abstract
More informationBayesian Networks Inference (continued) Learning
Learning BN tutorial: ftp://ftp.research.microsoft.com/pub/tr/tr-95-06.pdf TAN paper: http://www.cs.huji.ac.il/~nir/abstracts/frgg1.html Bayesian Networks Inference (continued) Learning Machine Learning
More informationNearest Neighbor with KD Trees
Case Study 2: Document Retrieval Finding Similar Documents Using Nearest Neighbors Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Emily Fox January 22 nd, 2013 1 Nearest
More information08 An Introduction to Dense Continuous Robotic Mapping
NAVARCH/EECS 568, ROB 530 - Winter 2018 08 An Introduction to Dense Continuous Robotic Mapping Maani Ghaffari March 14, 2018 Previously: Occupancy Grid Maps Pose SLAM graph and its associated dense occupancy
More informationData Mining: Exploring Data. Lecture Notes for Chapter 3
Data Mining: Exploring Data Lecture Notes for Chapter 3 1 What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include
More informationPrivacy-Preserving Machine Learning
Privacy-Preserving Machine Learning CS 760: Machine Learning Spring 2018 Mark Craven and David Page www.biostat.wisc.edu/~craven/cs760 1 Goals for the Lecture You should understand the following concepts:
More informationData Preprocessing. Komate AMPHAWAN
Data Preprocessing Komate AMPHAWAN 1 Data cleaning (data cleansing) Attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data. 2 Missing value
More informationLocality- Sensitive Hashing Random Projections for NN Search
Case Study 2: Document Retrieval Locality- Sensitive Hashing Random Projections for NN Search Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 18, 2017 Sham Kakade
More information0x1A Great Papers in Computer Security
CS 380S 0x1A Great Papers in Computer Security Vitaly Shmatikov http://www.cs.utexas.edu/~shmat/courses/cs380s/ C. Dwork Differential Privacy (ICALP 2006 and many other papers) Basic Setting DB= x 1 x
More informationPythia: Data Dependent Differentially Private Algorithm Selection
: Data Dependent Differentially Private Algorithm Selection ABSTRACT Ios Kotsogiannis Duke University iosk@cs.duke.edu Michael Hay Colgate University mhay@colgate.edu Differential privacy has emerged as
More informationPrivate Database Synthesis for Outsourced System Evaluation
Private Database Synthesis for Outsourced System Evaluation Vani Gupta 1, Gerome Miklau 1, and Neoklis Polyzotis 2 1 Dept. of Computer Science, University of Massachusetts, Amherst, MA, USA 2 Dept. of
More informationMachine Learning (BSMC-GA 4439) Wenke Liu
Machine Learning (BSMC-GA 4439) Wenke Liu 01-31-017 Outline Background Defining proximity Clustering methods Determining number of clusters Comparing two solutions Cluster analysis as unsupervised Learning
More informationBy Mahesh R. Sanghavi Associate professor, SNJB s KBJ CoE, Chandwad
By Mahesh R. Sanghavi Associate professor, SNJB s KBJ CoE, Chandwad Data Analytics life cycle Discovery Data preparation Preprocessing requirements data cleaning, data integration, data reduction, data
More informationCS570: Introduction to Data Mining
CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading: Chapter 10.3 Han, Chapter 9.5 Tan Cengiz Gunay, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei.
More informationarxiv: v1 [cs.ds] 12 Sep 2016
Jaewoo Lee Penn State University, University Par, PA 16801 Daniel Kifer Penn State University, University Par, PA 16801 JLEE@CSE.PSU.EDU DKIFER@CSE.PSU.EDU arxiv:1609.03251v1 [cs.ds] 12 Sep 2016 Abstract
More informationMachine Learning (BSMC-GA 4439) Wenke Liu
Machine Learning (BSMC-GA 4439) Wenke Liu 01-25-2018 Outline Background Defining proximity Clustering methods Determining number of clusters Other approaches Cluster analysis as unsupervised Learning Unsupervised
More informationBuilding Classifiers using Bayesian Networks
Building Classifiers using Bayesian Networks Nir Friedman and Moises Goldszmidt 1997 Presented by Brian Collins and Lukas Seitlinger Paper Summary The Naive Bayes classifier has reasonable performance
More informationData Clustering Hierarchical Clustering, Density based clustering Grid based clustering
Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering Team 2 Prof. Anita Wasilewska CSE 634 Data Mining All Sources Used for the Presentation Olson CF. Parallel algorithms
More informationData Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining
Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar What is data exploration? A preliminary exploration of the data to better understand its characteristics.
More informationDATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines
DATA MINING LECTURE 10B Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines NEAREST NEIGHBOR CLASSIFICATION 10 10 Illustrating Classification Task Tid Attrib1
More informationCS6375: Machine Learning Gautam Kunapuli. Mid-Term Review
Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes
More informationData Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining
Data Mining: Exploring Data Lecture Notes for Data Exploration Chapter Introduction to Data Mining by Tan, Steinbach, Karpatne, Kumar 02/03/2018 Introduction to Data Mining 1 What is data exploration?
More informationDetecting Salient Contours Using Orientation Energy Distribution. Part I: Thresholding Based on. Response Distribution
Detecting Salient Contours Using Orientation Energy Distribution The Problem: How Does the Visual System Detect Salient Contours? CPSC 636 Slide12, Spring 212 Yoonsuck Choe Co-work with S. Sarma and H.-C.
More informationClustering Algorithm (DBSCAN) VISHAL BHARTI Computer Science Dept. GC, CUNY
Clustering Algorithm (DBSCAN) VISHAL BHARTI Computer Science Dept. GC, CUNY Clustering Algorithm Clustering is an unsupervised machine learning algorithm that divides a data into meaningful sub-groups,
More informationECLT 5810 Data Preprocessing. Prof. Wai Lam
ECLT 5810 Data Preprocessing Prof. Wai Lam Why Data Preprocessing? Data in the real world is imperfect incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate
More informationNotes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)
1 Notes Reminder: HW2 Due Today by 11:59PM TA s note: Please provide a detailed ReadMe.txt file on how to run the program on the STDLINUX. If you installed/upgraded any package on STDLINUX, you should
More informationData Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA
Obj ti Objectives Motivation: Why preprocess the Data? Data Preprocessing Techniques Data Cleaning Data Integration and Transformation Data Reduction Data Preprocessing Lecture 3/DMBI/IKI83403T/MTI/UI
More informationECLT 5810 Clustering
ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping
More informationCS 229 Midterm Review
CS 229 Midterm Review Course Staff Fall 2018 11/2/2018 Outline Today: SVMs Kernels Tree Ensembles EM Algorithm / Mixture Models [ Focus on building intuition, less so on solving specific problems. Ask
More informationA Review on Cluster Based Approach in Data Mining
A Review on Cluster Based Approach in Data Mining M. Vijaya Maheswari PhD Research Scholar, Department of Computer Science Karpagam University Coimbatore, Tamilnadu,India Dr T. Christopher Assistant professor,
More informationRobust Shape Retrieval Using Maximum Likelihood Theory
Robust Shape Retrieval Using Maximum Likelihood Theory Naif Alajlan 1, Paul Fieguth 2, and Mohamed Kamel 1 1 PAMI Lab, E & CE Dept., UW, Waterloo, ON, N2L 3G1, Canada. naif, mkamel@pami.uwaterloo.ca 2
More informationClustering from Data Streams
Clustering from Data Streams João Gama LIAAD-INESC Porto, University of Porto, Portugal jgama@fep.up.pt 1 Introduction 2 Clustering Micro Clustering 3 Clustering Time Series Growing the Structure Adapting
More informationProblem 1: Complexity of Update Rules for Logistic Regression
Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 16 th, 2014 1
More informationCS573 Data Privacy and Security. Li Xiong
CS573 Data Privacy and Security Anonymizationmethods Li Xiong Today Clustering based anonymization(cont) Permutation based anonymization Other privacy principles Microaggregation/Clustering Two steps:
More informationIndexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel
Indexing Week 14, Spring 2005 Edited by M. Naci Akkøk, 5.3.2004, 3.3.2005 Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel Overview Conventional indexes B-trees Hashing schemes
More informationBayesian Spherical Wavelet Shrinkage: Applications to Shape Analysis
Bayesian Spherical Wavelet Shrinkage: Applications to Shape Analysis Xavier Le Faucheur a, Brani Vidakovic b and Allen Tannenbaum a a School of Electrical and Computer Engineering, b Department of Biomedical
More informationNearest Neighbor with KD Trees
Case Study 2: Document Retrieval Finding Similar Documents Using Nearest Neighbors Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Emily Fox January 22 nd, 2013 1 Nearest
More informationMeta-Clustering. Parasaran Raman PhD Candidate School of Computing
Meta-Clustering Parasaran Raman PhD Candidate School of Computing What is Clustering? Goal: Group similar items together Unsupervised No labeling effort Popular choice for large-scale exploratory data
More informationMSA220 - Statistical Learning for Big Data
MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups
More informationImage Analysis & Retrieval. CS/EE 5590 Special Topics (Class Ids: 44873, 44874) Fall 2016, M/W Lec 18.
Image Analysis & Retrieval CS/EE 5590 Special Topics (Class Ids: 44873, 44874) Fall 2016, M/W 4-5:15pm@Bloch 0012 Lec 18 Image Hashing Zhu Li Dept of CSEE, UMKC Office: FH560E, Email: lizhu@umkc.edu, Ph:
More informationELEC Dr Reji Mathew Electrical Engineering UNSW
ELEC 4622 Dr Reji Mathew Electrical Engineering UNSW Review of Motion Modelling and Estimation Introduction to Motion Modelling & Estimation Forward Motion Backward Motion Block Motion Estimation Motion
More informationEE795: Computer Vision and Intelligent Systems
EE795: Computer Vision and Intelligent Systems Spring 2012 TTh 17:30-18:45 FDH 204 Lecture 14 130307 http://www.ee.unlv.edu/~b1morris/ecg795/ 2 Outline Review Stereo Dense Motion Estimation Translational
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Features and Patterns The Curse of Size and
More informationData Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation
Data Mining Part 2. Data Understanding and Preparation 2.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Normalization Attribute Construction Aggregation Attribute Subset Selection Discretization
More informationObject Recognition Using Pictorial Structures. Daniel Huttenlocher Computer Science Department. In This Talk. Object recognition in computer vision
Object Recognition Using Pictorial Structures Daniel Huttenlocher Computer Science Department Joint work with Pedro Felzenszwalb, MIT AI Lab In This Talk Object recognition in computer vision Brief definition
More informationOutline. The History of Histograms. Yannis Ioannidis University of Athens, Hellas
The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline Prehistory Definitions and Framework The Early Past 10 Years Ago The Recent Past Industry Competitors The Future Prehistory
More informationMax-Count Aggregation Estimation for Moving Points
Max-Count Aggregation Estimation for Moving Points Yi Chen Peter Revesz Dept. of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln, NE 68588, USA Abstract Many interesting problems
More informationpcube: Update-Efficient Online Aggregation with Progressive Feedback and Error Bounds
pcube: Update-Efficient Online Aggregation with Progressive Feedback and Error Bounds Mirek Riedewald, Divyakant Agrawal, and Amr El Abbadi Department of Computer Science University of California, Santa
More informationQuadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase
Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase Bumjoon Jo and Sungwon Jung (&) Department of Computer Science and Engineering, Sogang University, 35 Baekbeom-ro, Mapo-gu, Seoul 04107,
More informationApplied Bayesian Nonparametrics 5. Spatial Models via Gaussian Processes, not MRFs Tutorial at CVPR 2012 Erik Sudderth Brown University
Applied Bayesian Nonparametrics 5. Spatial Models via Gaussian Processes, not MRFs Tutorial at CVPR 2012 Erik Sudderth Brown University NIPS 2008: E. Sudderth & M. Jordan, Shared Segmentation of Natural
More informationHidden Markov Models. Slides adapted from Joyce Ho, David Sontag, Geoffrey Hinton, Eric Xing, and Nicholas Ruozzi
Hidden Markov Models Slides adapted from Joyce Ho, David Sontag, Geoffrey Hinton, Eric Xing, and Nicholas Ruozzi Sequential Data Time-series: Stock market, weather, speech, video Ordered: Text, genes Sequential
More informationMethods for Intelligent Systems
Methods for Intelligent Systems Lecture Notes on Clustering (II) Davide Eynard eynard@elet.polimi.it Department of Electronics and Information Politecnico di Milano Davide Eynard - Lecture Notes on Clustering
More informationChapter 5 Efficient Memory Information Retrieval
Chapter 5 Efficient Memory Information Retrieval In this chapter, we will talk about two topics: (1) What is a kd-tree? (2) How can we use kdtrees to speed up the memory-based learning algorithms? Since
More informationIntroduction to Data Mining
Introduction to Data Mining Privacy preserving data mining Li Xiong Slides credits: Chris Clifton Agrawal and Srikant 4/3/2011 1 Privacy Preserving Data Mining Privacy concerns about personal data AOL
More informationECLT 5810 Clustering
ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping
More informationObject Classification Problem
HIERARCHICAL OBJECT CATEGORIZATION" Gregory Griffin and Pietro Perona. Learning and Using Taxonomies For Fast Visual Categorization. CVPR 2008 Marcin Marszalek and Cordelia Schmid. Constructing Category
More informationOrganizing Spatial Data
Organizing Spatial Data Spatial data records include a sense of location as an attribute. Typically location is represented by coordinate data (in 2D or 3D). 1 If we are to search spatial data using the
More informationPattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition
Pattern Recognition Kjell Elenius Speech, Music and Hearing KTH March 29, 2007 Speech recognition 2007 1 Ch 4. Pattern Recognition 1(3) Bayes Decision Theory Minimum-Error-Rate Decision Rules Discriminant
More information