CS573 Data Privacy and Security. Differential Privacy tabular data and range queries. Li Xiong

Size: px
Start display at page:

Download "CS573 Data Privacy and Security. Differential Privacy tabular data and range queries. Li Xiong"

Transcription

1 CS573 Data Privacy and Security Differential Privacy tabular data and range queries Li Xiong

2 Outline Tabular data and histogram/range queries Algorithms for low dimensional data Algorithms for high dimensional data

3 Example: cohort discovery from medical records Histograms Cohort discovery: range queries Select COUNT(*) from D Where A1 in I1 and A2 in I2 and and Am in Im.

4 Example: statistical agencies: data publishing A marginal over attributes A 1,, A k reports count for each combination of attribute values. aka cube, contingency table E.g. 2-way marginal on EmploymentStatus and Gender U.S. Census Bureau statistics can typically be derived from k-way marginal over different combinations of available attributes Hundreds of marginals released Module 3 Tutorial: Differential Privacy in the Wild 4

5 Example: range queries over spatial data Input: sensitive data D Input: range query workload W Shown is workload of 3 range queries BeijingTaxi dataset[1]: 4,268,780 records of (lat,lon) pairs of taxi pickup locations in Beijing, China in 1 month. Scatter plot of input data Task: compute answers to workload W over private input D Module 3 [1]Raw data from: Taxi trajectory open Tutorial: Differential Privacy in the Wild dataset, Tsinghua university, China

6 Problem variant: offline vs. online Offline (batch): Entire W given as input, answers computed in batch Online (adaptive): W is sequence q 1, q 2, that arrives online Adaptive: analyst s choice for q i can depend on answers a 1,, a i 1 Module 3 Tutorial: Differential Privacy in the Wild 6

7 Important aspects of problem: Data and query complexity Data complexity Dimensionality: number of attributes Domain size: number of distinct attribute combinations Many techniques specialized for low dimensional data Query complexity Given query workload vs. no query workload Classes of queries: histograms, count queries, linear queries (sum, average), median Module 3 Tutorial: Differential Privacy in the Wild 7

8 Solution variants: query answers vs. synthetic data Two high-level approaches to solving problem 1. Direct: Output of the algorithm is list of query answers 2. Synthetic data: Algorithm constructs a synthetic dataset D, which can be queried directly by analyst Analyst can pose additional queries on D (though answers may not be accurate) Module 3 Tutorial: Differential Privacy in the Wild 8

9 Synthetic Data: Categories of Methods Nonparametric methods release empirical distributions, i.e. histograms with differential privacy Parametric and semi-parametric methods learn parameters of a distribution with differential privacy

10 Outline Tabular data and histogram/range queries Algorithms for low dimensional data Baseline Partitioning algorithms: kd tree, quad tree, Transformation: Wavelet, Fourier Transform, An evaluation framework: DPBench Algorithms for high dimensional data

11 Baseline algorithm: IDENTITY Scatter plot of input data 1. Discretize attribute domain into cells 2. Add noise to cell counts (Laplace mechanism) unit histogram 3. Use noisy counts to either 1. Answer queries directly (assume distribution is uniform within cell) 2. Generate synthetic data (derive distribution from counts and sample) Module 3 Tutorial: Differential Privacy in the Wild

12 Baseline algorithm: IDENTITY Scatter plot of input data Limitations Granularity of discretization Coarse: detail lost Fine: noise overwhelms signal Noise accumulates: squared error grows linearly with range 1. Discretize attribute domain into cells 2. Add noise to cell counts (Laplace mechanism) unit histogram 3. Use noisy counts to either 1. Answer queries directly (assume distribution is uniform within cell) 2. Generate synthetic data (derive distribution from counts and sample) Module 3 Tutorial: Differential Privacy in the Wild

13 [HMMCZ16] Empirical benchmarks An evaluation framework for standardized evaluation of privacy algorithms for range queries (over 1 and 2D) Demo:

14 Data-Dependent Partitioning Domain-based (data-independent) partitioning does not work very well Equi-width: equal bucket range Uniformity assumption Data-driven partitioning V-optimal: with the least frequency variance Intuition: highest uniformity within each bucket How to do it with differential privacy? October 2,

15 Histograms (review) Divide data into buckets and store average (sum) for each bucket Partitioning rules: Equi-width: equal bucket range Equi-depth: equal frequency V-optimal: with the least frequency variance October 2,

16 An Early Attempt: DPCube [SDM 2010, ICDE 2012 demo] Name Age Income HIV+ Frank 42 30K Y Bob 31 60K Y Mary 28 20K Y ε/2-dp Original Records 1. Compute unit histogram with differential privacy 2. kd-tree partitioning 3. Compute merged bin counts with differential privacy ε/2-dp DP Interface DP unit Histogram DP V-optimal Histogram Multi-dimensional partitioning

17 kd-tree based partitioning Choose dimension and splitting point to split (minimize variance) Repeat until: Count of this node less than threshold Variance or entropy of this node less than threshold

18 DPCube [SecureDM 2010, ICDE 2012 demo] Name Age Income HIV+ Frank 42 30K Y Bob 31 60K Y Mary 28 20K Y ε/2-dp Original Records Limitations: DP unit histogram very noisy Affects the accuracy of partitioning Sequential composition ε/2-dp DP unit Histogram Multi-dimensional partitioning DP V-optimal Histogram DP Interface

19 A Later Improvement: Private Spatial decompositions [CPSSY 12] quadtree kd-tree Approach: (top down) partitioning with differential privacy Quad tree and hybrid/kd-tree

20 Building a Private kd-tree Process to build a private kd-tree Input: maximum height h, minimum leaf size L, data set Choose dimension to split Get (private) median in this dimension Create child nodes and add noise to the counts Recurse until: Max height is reached Noisy count of this node less than L Budget along the root-leaf path has used up The entire PSD satisfies DP by the composition property 21

21 Building a Private kd-tree Process to build a private kd-tree Input: maximum height h, minimum leaf size L, data set Choose dimension to split Get (private) median in this dimension exponential mechanism with utility function(x) = rank(x) rank(median) Create child nodes and add noise to the counts Recurse until: Max height is reached Noisy count of this node less than L Budget along the root-leaf path has used up The entire PSD satisfies DP by the composition property 22

22 Building Private Spatial Decompositions privacy budget allocation Budget is split between medians and counts at each node Tradeoff accuracy of division with accuracy of counts Budget is split across levels of the tree Privacy budget used along any root-leaf path should total Optimal budget allocation Post processing with consistency check Sequential composition Parallel composition 23

23 Data-dependent partitioning Heuristics based methods Kd-tree, quad-tree Optimal methods V-optimal histogram (1D or 2D) Module 3 Tutorial: Differential Privacy in the Wild 24

24 Data-aware/Workload-Aware Mechanism [LHMY14] Step 1: dynamic programming based methods for optimal partitioning Step 2: matrix mechanism for optimal noise given a query workload

25 Data Transformations Can think of trees as a data-dependent transform of input Can apply other data transformations General idea: Apply transform of data Add noise in the transformed space (based on sensitivity) Publish noisy coefficients, or invert transform (post-processing) Goal: pick a transform that preserves good properties of data And which has low sensitivity, so noise does not corrupt Original Data Transform Noise Noisy Coefficients Coefficients Invert Private Data 26

26 [HMMCZ16] Empirical benchmarks An evaluation framework for standardized evaluation of privacy algorithms for range queries (over 1 and 2D) Demo: Key findings: Scale/size and shape of data significantly affect algorithm error In a high signal regime (high scale, high epsilon), simpler data independent methods such as IDENTITY works well In a low signal regime (low scale, low epsilon), datadependent algorithm should be considered but no guarantees While no algorithm universally dominates across settings, DAWA is a competitive choice on most datasets

27 Programming Assignment and Competition: Laplace mechanism for Range queries Required: Implement the baseline IDENTITY histogram algorithm Evaluate accuracy for random set of range queries Optional: Optimizations and enhancement Competition

28 Outline Tabular data and histogram/range queries Algorithms for low dimensional data Baseline Partitioning algorithms: kd tree, quad tree, An evaluation framework: DPBench Algorithms for high dimensional data Copula functions [LXJ14] Bayesian networks [ZCPSX14]

29 Traditional Approaches Parametric methods Fit the data to a distribution, make inferences about parameters e.g. PrivacyOnTheMap Non-parametric methods Original data Synthetic data Perturbation Histogram Learn empirical distribution through histograms e.g. PSD, Privelet, FP, P-HP

30 Semi-parametric modeling using Copula Semi-parametric methods functions Haoran Li, Li Xiong, Xiaoqian Jiang. Differentially Private Synthesization of Multi-Dimensional Data using Copula Functions, EDBT 2014

31 Gaussian copula: models the dependence with arbitrary margins Gaussian distribution: models the joint distribution

32 DP marginal histograms Original data set Age Hours /week Income K K K K MLE Age Hours/week Income Step 1: Computing DP marginal Histograms 1 ~ P DP correlation matrix 1 DP dependence structure DP synthetic data set Age Hours /week Income K K K K Step 3: Sampling DP synthetic data Step 2: Computing DP correlation matrix through DP MLE (Maximum Likelihood Estimation

33 Overview Age Hours /week Income Gender K F K M K F K M Gender = F Gender = M Age Hours /week Income Age Hours /week Income K K K K DPCopula DPCopula n~ n Lap(1/ ) Age Hours /week Income K K ~n K K 2 Age Hours /week Income n~ n Lap(1/ ) ~n

34 Datasets US Census data: 4 attributes, 100,000 records Brazil data: 8 attributes, 188,846 records Synthetic data Comparison: PSD, Privelet+, FP, P-HP Metrics: Random range-count queries with random query predicates covering all attributes Relative error: Absolute error:

35 Query accuracy vs. differential privacy budget

36 Gaussian dependence assumption Pair-wise attribute correlation does not scale with high dimensions Works well for continuous data or attributes with large domains

37 Tabular data and histogram/range queries Algorithms for low dimensional data Baseline Partitioning algorithms: kd tree, quad tree, An evaluation framework: DPBench Algorithms for high dimensional data Copula functions [LXJ14] Bayesian networks [ZCPSX14]

38 convert + noise sample sensitive database D full-dim tuple distribution noisy distribution synthetic database D approximate convert + noise sample a set of low-dim distributions noisy low-dim distributions

39 Bayesian network example P ( B ) P ( E ) t f t f Burglary Earthquake Alarm P ( A B, E ) B E t f t t t f f t f f P ( J A) A t f t f JohnCalls MaryCalls P ( M A) A t f t f

40 A 5-dimensional database: Pr age age Pr work age workclass education Pr edu age title Pr title work income Pr income work

41 A 5-dimensional database: age workclass income education title Pr Pr age Pr work age Pr edu age Pr title work Pr income work

42 STEP 1: Choose a suitable Bayesian network N must in a differentially private way STEP 2: Compute conditional distributions implied by N straightforward to do under differential privacy inject noise Laplace mechanism STEP 3: Generate synthetic data by sampling from N post-processing: no privacy issues

43 Finding optimal 1-degree Bayesian network was solved in [Chow-Liu 68]. It is a DAG of maximum in-degree 1, and maximizes the sum of mutual information I of its edges finding the maximum spanning tree, where the weight of edge (X, Y) is mutual information I(X, Y).

44

45 Build a 1-degree BN for database A B C D Alan Bob Cykie David Eric Frank George Helen Ivan Jack

46 Start from a random attribute A A C B D

47 Select next tree edge by its mutual information A B A B C D Alan 0 0 C Bob Cykie David Eric Frank George Helen D Ivan Jack candidates: A B A C A D

48 Select next tree edge by its mutual information A I = 1 I = 0. 4 I = 0 C candidates: A B A C A D B D

49 Select next tree edge by its mutual information A C B D

50 Select next tree edge by its mutual information A B I = 0 I = 0. 4 I = 0. 2 I = 0 C D candidates: A C A D B C B D

51 Select next tree edge by its mutual information A C DONE! B D

52 Do it under Differential Privacy! (Non-private) select the edge with maximum I (Private) I is data-sensitive -> the best edge is also data-sensitive

53 Databases D Edges e define q D, e R How good edge e is as the result of selection, given database D Return e with probability: Pr[e] exp ε 2 q D, e Δ q info noise where Δ q = max D,D,e q D, e q(d, e) 1

54 STEP 1: Choose a suitable Bayesian network N must in a differentially private way STEP 2: Compute conditional distributions implied by N straightforward to do under differential privacy inject noise Laplace mechanism STEP 3: Generate synthetic data by sampling from N post-processing: no privacy issues

55 Outline Tabular data and histogram/range queries Algorithms for low dimensional data Baseline Partitioning algorithms: kd tree, quad tree, Transformation: Wavelet, Fourier Transform, Algorithms for high dimensional data Copula functions Bayesian networks

56 Open questions High dimensional data Robust and private algorithm selection Error bounds for data-dependent algorithms Module 3 Tutorial: Differential Privacy in the Wild 70

57 References [ACC12] Aćs et al. Differentially private histogram publishing through lossy compression. In ICDM, [BBDS12] Blocki et al. The johnson-lindenstrauss transform itself preserves differential privacy. In FOCS, [BCDKMT07] Barak et al. Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In PODS, [BLR08] Blum et al. A learning theory approach to noninteractive database privacy. In STOC, [DNRR15] Dwork et al. Pure Differential Privacy for Rectangle Queries via Private Partitions. In ASIACRYPT, [CPSSY12] Cormode et al. Differentially Private Spatial Decompositions. In ICDE, [GAHRW14] Gaboardi et al. Dual Query: Practical Private Query Release for High Dimensional Data. In ICML, [HLM12] Hardt et al. A simple and practical algorithm for differentially private data release. In NIPS, [HMMCZ16] Hay et al. Principled Evaluation of Differentially Private Algorithms using DPBench. In SIGMOD, [HRMS10] Hay et al. Boosting the accuracy of differentially private histograms through consistency. In PVLDB, [LHMY14] Li et al. A data- and workload-aware algorithm for range queries under differential privacy. In PVLDB, [LHRMM10] Li et al. Optimizing linear counting queries under differential privacy. In PODS, [LM12] Li et al. An adaptive mechanism for accurate query answering under differential privacy. In PVLDB, [LM13] Li et al. Optimal error of query sets under the differentially-private matrix mechanism. In ICDT, [LZWY11] Li et al. Compressive mechanism: utilizing sparse representation in differential privacy. In WPES, [QYL13] Qardaji et al. Understanding hierarchical methods for differentially private histograms. In PVLDB, [QYL13] Qardaji et al. Differentially private grids for geospatial data. In ICDE, [RN10] Rastogi et al. Differentially private aggregation of distributed time-series with transformation and encryption. In SIGMOD, [WWLTRD09] Wang et al. Privacy-preserving genomic computation through program specialization. In CCS, [XWG10] Xiao et al. Differential privacy via wavelet transforms. In ICDE, [ZCPSX14] Zhang et al. PrivBayes: private data release via bayesian networks. In SIGMOD, [ZXX16] Zhang et al. PrivTree: A Differentially Private Algorithm for Hierarchical Decompositions. In SIGMOD, Module 3 Tutorial: Differential Privacy in the Wild 71

Matrix Mechanism and Data Dependent algorithms

Matrix Mechanism and Data Dependent algorithms Matrix Mechanism and Data Dependent algorithms CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 9 : 590.03 Fall 16 1 Recap: Constrained Inference Lecture 9 : 590.03 Fall 16 2 Constrained Inference

More information

Differentially Private H-Tree

Differentially Private H-Tree GeoPrivacy: 2 nd Workshop on Privacy in Geographic Information Collection and Analysis Differentially Private H-Tree Hien To, Liyue Fan, Cyrus Shahabi Integrated Media System Center University of Southern

More information

Co-clustering for differentially private synthetic data generation

Co-clustering for differentially private synthetic data generation Co-clustering for differentially private synthetic data generation Tarek Benkhelif, Françoise Fessant, Fabrice Clérot and Guillaume Raschia January 23, 2018 Orange Labs & LS2N Journée thématique EGC &

More information

Statistical and Synthetic Data Sharing with Differential Privacy

Statistical and Synthetic Data Sharing with Differential Privacy pscanner and idash Data Sharing Symposium UCSD, Sept 30 Oct 2, 2015 Statistical and Synthetic Data Sharing with Differential Privacy Li Xiong Department of Mathematics and Computer Science Department of

More information

Differentially Private Multi- Dimensional Time Series Release for Traffic Monitoring

Differentially Private Multi- Dimensional Time Series Release for Traffic Monitoring DBSec 13 Differentially Private Multi- Dimensional Time Series Release for Traffic Monitoring Liyue Fan, Li Xiong, Vaidy Sunderam Department of Math & Computer Science Emory University 9/4/2013 DBSec'13:

More information

Privacy Preserving Machine Learning: A Theoretically Sound App

Privacy Preserving Machine Learning: A Theoretically Sound App Privacy Preserving Machine Learning: A Theoretically Sound Approach Outline 1 2 3 4 5 6 Privacy Leakage Events AOL search data leak: New York Times journalist was able to identify users from the anonymous

More information

CS573 Data Privacy and Security. Differential Privacy. Li Xiong

CS573 Data Privacy and Security. Differential Privacy. Li Xiong CS573 Data Privacy and Security Differential Privacy Li Xiong Outline Differential Privacy Definition Basic techniques Composition theorems Statistical Data Privacy Non-interactive vs interactive Privacy

More information

Data Preprocessing. Slides by: Shree Jaswal

Data Preprocessing. Slides by: Shree Jaswal Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data

More information

UNIT 2 Data Preprocessing

UNIT 2 Data Preprocessing UNIT 2 Data Preprocessing Lecture Topic ********************************************** Lecture 13 Why preprocess the data? Lecture 14 Lecture 15 Lecture 16 Lecture 17 Data cleaning Data integration and

More information

Privacy Preserving Data Publishing: From k-anonymity to Differential Privacy. Xiaokui Xiao Nanyang Technological University

Privacy Preserving Data Publishing: From k-anonymity to Differential Privacy. Xiaokui Xiao Nanyang Technological University Privacy Preserving Data Publishing: From k-anonymity to Differential Privacy Xiaokui Xiao Nanyang Technological University Outline Privacy preserving data publishing: What and Why Examples of privacy attacks

More information

Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams

Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams Mining Data Streams Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction Summarization Methods Clustering Data Streams Data Stream Classification Temporal Models CMPT 843, SFU, Martin Ester, 1-06

More information

Differentially Private Multi-Dimensional Time Series Release for Traffic Monitoring

Differentially Private Multi-Dimensional Time Series Release for Traffic Monitoring Differentially Private Multi-Dimensional Time Series Release for Traffic Monitoring Liyue Fan, Li Xiong, and Vaidy Sunderam Emory University Atlanta GA 30322, USA {lfan3,lxiong,vss}@mathcs.emory.edu Abstract.

More information

Demonstration of Damson: Differential Privacy for Analysis of Large Data

Demonstration of Damson: Differential Privacy for Analysis of Large Data Demonstration of Damson: Differential Privacy for Analysis of Large Data Marianne Winslett 1,2, Yin Yang 1,2, Zhenjie Zhang 1 1 Advanced Digital Sciences Center, Singapore {yin.yang, zhenjie}@adsc.com.sg

More information

COMP 465: Data Mining Still More on Clustering

COMP 465: Data Mining Still More on Clustering 3/4/015 Exercise COMP 465: Data Mining Still More on Clustering Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Describe each of the following

More information

Differentially Private H-Tree

Differentially Private H-Tree Differentially Private H-Tree Hien To, Liyue Fan, Cyrus Shahabi Integrated Media Systems Center University of Southern California Los Angeles, CA, U.S.A {hto,liyuefan,shahabi}@usc.edu ABSTRACT In this

More information

CS 521 Data Mining Techniques Instructor: Abdullah Mueen

CS 521 Data Mining Techniques Instructor: Abdullah Mueen CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 2: DATA TRANSFORMATION AND DIMENSIONALITY REDUCTION Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major Tasks

More information

Pufferfish: A Semantic Approach to Customizable Privacy

Pufferfish: A Semantic Approach to Customizable Privacy Pufferfish: A Semantic Approach to Customizable Privacy Ashwin Machanavajjhala ashwin AT cs.duke.edu Collaborators: Daniel Kifer (Penn State), Bolin Ding (UIUC, Microsoft Research) idash Privacy Workshop

More information

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4 Principles of Knowledge Discovery in Data Fall 2004 Chapter 3: Data Preprocessing Dr. Osmar R. Zaïane University of Alberta Summary of Last Chapter What is a data warehouse and what is it for? What is

More information

Data Anonymization. Graham Cormode.

Data Anonymization. Graham Cormode. Data Anonymization Graham Cormode graham@research.att.com 1 Why Anonymize? For Data Sharing Give real(istic) data to others to study without compromising privacy of individuals in the data Allows third-parties

More information

Differentially Private Spatial Decompositions

Differentially Private Spatial Decompositions Differentially Private Spatial Decompositions Graham Cormode Cecilia Procopiuc Divesh Srivastava AT&T Labs Research {graham, magda, divesh}@research.att.com Entong Shen Ting Yu North Carolina State University

More information

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha Data Preprocessing S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha 1 Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking

More information

Multidimensional Indexes [14]

Multidimensional Indexes [14] CMSC 661, Principles of Database Systems Multidimensional Indexes [14] Dr. Kalpakis http://www.csee.umbc.edu/~kalpakis/courses/661 Motivation Examined indexes when search keys are in 1-D space Many interesting

More information

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu FMA901F: Machine Learning Lecture 3: Linear Models for Regression Cristian Sminchisescu Machine Learning: Frequentist vs. Bayesian In the frequentist setting, we seek a fixed parameter (vector), with value(s)

More information

Privacy-preserving machine learning. Bo Liu, the HKUST March, 1st, 2015.

Privacy-preserving machine learning. Bo Liu, the HKUST March, 1st, 2015. Privacy-preserving machine learning Bo Liu, the HKUST March, 1st, 2015. 1 Some slides extracted from Wang Yuxiang, Differential Privacy: a short tutorial. Cynthia Dwork, The Promise of Differential Privacy.

More information

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation = noisy: containing

More information

CS570: Introduction to Data Mining

CS570: Introduction to Data Mining CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu, Ph.D. Some slides courtesy of Li Xiong, Ph.D. and 2011 Han, Kamber & Pei. Data Mining. Morgan Kaufmann.

More information

2. Data Preprocessing

2. Data Preprocessing 2. Data Preprocessing Contents of this Chapter 2.1 Introduction 2.2 Data cleaning 2.3 Data integration 2.4 Data transformation 2.5 Data reduction Reference: [Han and Kamber 2006, Chapter 2] SFU, CMPT 459

More information

3. Data Preprocessing. 3.1 Introduction

3. Data Preprocessing. 3.1 Introduction 3. Data Preprocessing Contents of this Chapter 3.1 Introduction 3.2 Data cleaning 3.3 Data integration 3.4 Data transformation 3.5 Data reduction SFU, CMPT 740, 03-3, Martin Ester 84 3.1 Introduction Motivation

More information

Holistic and Compact Selectivity Estimation for Hybrid Queries over RDF Graphs

Holistic and Compact Selectivity Estimation for Hybrid Queries over RDF Graphs Holistic and Compact Selectivity Estimation for Hybrid Queries over RDF Graphs Authors: Andreas Wagner, Veli Bicer, Thanh Tran, and Rudi Studer Presenter: Freddy Lecue IBM Research Ireland 2014 International

More information

Algorithmic Approaches to Preventing Overfitting in Adaptive Data Analysis. Part 1 Aaron Roth

Algorithmic Approaches to Preventing Overfitting in Adaptive Data Analysis. Part 1 Aaron Roth Algorithmic Approaches to Preventing Overfitting in Adaptive Data Analysis Part 1 Aaron Roth The 2015 ImageNet competition An image classification competition during a heated war for deep learning talent

More information

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners Data Mining 3.5 (Instance-Based Learners) Fall 2008 Instructor: Dr. Masoud Yaghini Outline Introduction k-nearest-neighbor Classifiers References Introduction Introduction Lazy vs. eager learning Eager

More information

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015 Sum-Product Networks STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015 Introduction Outline What is a Sum-Product Network? Inference Applications In more depth

More information

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska Preprocessing Short Lecture Notes cse352 Professor Anita Wasilewska Data Preprocessing Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept

More information

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai Decision Trees Decision Tree Decision Trees (DTs) are a nonparametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target

More information

Distributed Private Data Collection at Scale

Distributed Private Data Collection at Scale Distributed Private Data Collection at Scale Graham Cormode g.cormode@warwick.ac.uk Tejas Kulkarni (Warwick) Divesh Srivastava (AT&T) 1 Big data, big problem? The big data meme has taken root Organizations

More information

Parallel Composition Revisited

Parallel Composition Revisited Parallel Composition Revisited Chris Clifton 23 October 2017 This is joint work with Keith Merrill and Shawn Merrill This work supported by the U.S. Census Bureau under Cooperative Agreement CB16ADR0160002

More information

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures: Homework Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression 3.0-3.2 Pod-cast lecture on-line Next lectures: I posted a rough plan. It is flexible though so please come with suggestions Bayes

More information

Detecting Novel Associations in Large Data Sets

Detecting Novel Associations in Large Data Sets Detecting Novel Associations in Large Data Sets J. Hjelmborg Department of Biostatistics 5. februar 2013 Biostat (Biostatistics) Detecting Novel Associations in Large Data Sets 5. februar 2013 1 / 22 Overview

More information

Algorithms for GIS:! Quadtrees

Algorithms for GIS:! Quadtrees Algorithms for GIS: Quadtrees Quadtree A data structure that corresponds to a hierarchical subdivision of the plane Start with a square (containing inside input data) Divide into 4 equal squares (quadrants)

More information

with BLENDER: Enabling Local Search a Hybrid Differential Privacy Model

with BLENDER: Enabling Local Search a Hybrid Differential Privacy Model BLENDER: Enabling Local Search with a Hybrid Differential Privacy Model Brendan Avent 1, Aleksandra Korolova 1, David Zeber 2, Torgeir Hovden 2, Benjamin Livshits 3 University of Southern California 1

More information

Security Control Methods for Statistical Database

Security Control Methods for Statistical Database Security Control Methods for Statistical Database Li Xiong CS573 Data Privacy and Security Statistical Database A statistical database is a database which provides statistics on subsets of records OLAP

More information

Hierarchical Intelligent Cuttings: A Dynamic Multi-dimensional Packet Classification Algorithm

Hierarchical Intelligent Cuttings: A Dynamic Multi-dimensional Packet Classification Algorithm 161 CHAPTER 5 Hierarchical Intelligent Cuttings: A Dynamic Multi-dimensional Packet Classification Algorithm 1 Introduction We saw in the previous chapter that real-life classifiers exhibit structure and

More information

Differentially Private Histogram Publication

Differentially Private Histogram Publication Noname manuscript No. will be inserted by the editor) Differentially Private Histogram Publication Jia Xu Zhenjie Zhang Xiaokui Xiao Yin Yang Ge Yu Marianne Winslett Received: date / Accepted: date Abstract

More information

Bayesian Networks Inference (continued) Learning

Bayesian Networks Inference (continued) Learning Learning BN tutorial: ftp://ftp.research.microsoft.com/pub/tr/tr-95-06.pdf TAN paper: http://www.cs.huji.ac.il/~nir/abstracts/frgg1.html Bayesian Networks Inference (continued) Learning Machine Learning

More information

Nearest Neighbor with KD Trees

Nearest Neighbor with KD Trees Case Study 2: Document Retrieval Finding Similar Documents Using Nearest Neighbors Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Emily Fox January 22 nd, 2013 1 Nearest

More information

08 An Introduction to Dense Continuous Robotic Mapping

08 An Introduction to Dense Continuous Robotic Mapping NAVARCH/EECS 568, ROB 530 - Winter 2018 08 An Introduction to Dense Continuous Robotic Mapping Maani Ghaffari March 14, 2018 Previously: Occupancy Grid Maps Pose SLAM graph and its associated dense occupancy

More information

Data Mining: Exploring Data. Lecture Notes for Chapter 3

Data Mining: Exploring Data. Lecture Notes for Chapter 3 Data Mining: Exploring Data Lecture Notes for Chapter 3 1 What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include

More information

Privacy-Preserving Machine Learning

Privacy-Preserving Machine Learning Privacy-Preserving Machine Learning CS 760: Machine Learning Spring 2018 Mark Craven and David Page www.biostat.wisc.edu/~craven/cs760 1 Goals for the Lecture You should understand the following concepts:

More information

Data Preprocessing. Komate AMPHAWAN

Data Preprocessing. Komate AMPHAWAN Data Preprocessing Komate AMPHAWAN 1 Data cleaning (data cleansing) Attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data. 2 Missing value

More information

Locality- Sensitive Hashing Random Projections for NN Search

Locality- Sensitive Hashing Random Projections for NN Search Case Study 2: Document Retrieval Locality- Sensitive Hashing Random Projections for NN Search Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 18, 2017 Sham Kakade

More information

0x1A Great Papers in Computer Security

0x1A Great Papers in Computer Security CS 380S 0x1A Great Papers in Computer Security Vitaly Shmatikov http://www.cs.utexas.edu/~shmat/courses/cs380s/ C. Dwork Differential Privacy (ICALP 2006 and many other papers) Basic Setting DB= x 1 x

More information

Pythia: Data Dependent Differentially Private Algorithm Selection

Pythia: Data Dependent Differentially Private Algorithm Selection : Data Dependent Differentially Private Algorithm Selection ABSTRACT Ios Kotsogiannis Duke University iosk@cs.duke.edu Michael Hay Colgate University mhay@colgate.edu Differential privacy has emerged as

More information

Private Database Synthesis for Outsourced System Evaluation

Private Database Synthesis for Outsourced System Evaluation Private Database Synthesis for Outsourced System Evaluation Vani Gupta 1, Gerome Miklau 1, and Neoklis Polyzotis 2 1 Dept. of Computer Science, University of Massachusetts, Amherst, MA, USA 2 Dept. of

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 01-31-017 Outline Background Defining proximity Clustering methods Determining number of clusters Comparing two solutions Cluster analysis as unsupervised Learning

More information

By Mahesh R. Sanghavi Associate professor, SNJB s KBJ CoE, Chandwad

By Mahesh R. Sanghavi Associate professor, SNJB s KBJ CoE, Chandwad By Mahesh R. Sanghavi Associate professor, SNJB s KBJ CoE, Chandwad Data Analytics life cycle Discovery Data preparation Preprocessing requirements data cleaning, data integration, data reduction, data

More information

CS570: Introduction to Data Mining

CS570: Introduction to Data Mining CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading: Chapter 10.3 Han, Chapter 9.5 Tan Cengiz Gunay, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei.

More information

arxiv: v1 [cs.ds] 12 Sep 2016

arxiv: v1 [cs.ds] 12 Sep 2016 Jaewoo Lee Penn State University, University Par, PA 16801 Daniel Kifer Penn State University, University Par, PA 16801 JLEE@CSE.PSU.EDU DKIFER@CSE.PSU.EDU arxiv:1609.03251v1 [cs.ds] 12 Sep 2016 Abstract

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 01-25-2018 Outline Background Defining proximity Clustering methods Determining number of clusters Other approaches Cluster analysis as unsupervised Learning Unsupervised

More information

Building Classifiers using Bayesian Networks

Building Classifiers using Bayesian Networks Building Classifiers using Bayesian Networks Nir Friedman and Moises Goldszmidt 1997 Presented by Brian Collins and Lukas Seitlinger Paper Summary The Naive Bayes classifier has reasonable performance

More information

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering Team 2 Prof. Anita Wasilewska CSE 634 Data Mining All Sources Used for the Presentation Olson CF. Parallel algorithms

More information

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar What is data exploration? A preliminary exploration of the data to better understand its characteristics.

More information

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines DATA MINING LECTURE 10B Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines NEAREST NEIGHBOR CLASSIFICATION 10 10 Illustrating Classification Task Tid Attrib1

More information

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes

More information

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining Data Mining: Exploring Data Lecture Notes for Data Exploration Chapter Introduction to Data Mining by Tan, Steinbach, Karpatne, Kumar 02/03/2018 Introduction to Data Mining 1 What is data exploration?

More information

Detecting Salient Contours Using Orientation Energy Distribution. Part I: Thresholding Based on. Response Distribution

Detecting Salient Contours Using Orientation Energy Distribution. Part I: Thresholding Based on. Response Distribution Detecting Salient Contours Using Orientation Energy Distribution The Problem: How Does the Visual System Detect Salient Contours? CPSC 636 Slide12, Spring 212 Yoonsuck Choe Co-work with S. Sarma and H.-C.

More information

Clustering Algorithm (DBSCAN) VISHAL BHARTI Computer Science Dept. GC, CUNY

Clustering Algorithm (DBSCAN) VISHAL BHARTI Computer Science Dept. GC, CUNY Clustering Algorithm (DBSCAN) VISHAL BHARTI Computer Science Dept. GC, CUNY Clustering Algorithm Clustering is an unsupervised machine learning algorithm that divides a data into meaningful sub-groups,

More information

ECLT 5810 Data Preprocessing. Prof. Wai Lam

ECLT 5810 Data Preprocessing. Prof. Wai Lam ECLT 5810 Data Preprocessing Prof. Wai Lam Why Data Preprocessing? Data in the real world is imperfect incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate

More information

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017) 1 Notes Reminder: HW2 Due Today by 11:59PM TA s note: Please provide a detailed ReadMe.txt file on how to run the program on the STDLINUX. If you installed/upgraded any package on STDLINUX, you should

More information

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA Obj ti Objectives Motivation: Why preprocess the Data? Data Preprocessing Techniques Data Cleaning Data Integration and Transformation Data Reduction Data Preprocessing Lecture 3/DMBI/IKI83403T/MTI/UI

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

CS 229 Midterm Review

CS 229 Midterm Review CS 229 Midterm Review Course Staff Fall 2018 11/2/2018 Outline Today: SVMs Kernels Tree Ensembles EM Algorithm / Mixture Models [ Focus on building intuition, less so on solving specific problems. Ask

More information

A Review on Cluster Based Approach in Data Mining

A Review on Cluster Based Approach in Data Mining A Review on Cluster Based Approach in Data Mining M. Vijaya Maheswari PhD Research Scholar, Department of Computer Science Karpagam University Coimbatore, Tamilnadu,India Dr T. Christopher Assistant professor,

More information

Robust Shape Retrieval Using Maximum Likelihood Theory

Robust Shape Retrieval Using Maximum Likelihood Theory Robust Shape Retrieval Using Maximum Likelihood Theory Naif Alajlan 1, Paul Fieguth 2, and Mohamed Kamel 1 1 PAMI Lab, E & CE Dept., UW, Waterloo, ON, N2L 3G1, Canada. naif, mkamel@pami.uwaterloo.ca 2

More information

Clustering from Data Streams

Clustering from Data Streams Clustering from Data Streams João Gama LIAAD-INESC Porto, University of Porto, Portugal jgama@fep.up.pt 1 Introduction 2 Clustering Micro Clustering 3 Clustering Time Series Growing the Structure Adapting

More information

Problem 1: Complexity of Update Rules for Logistic Regression

Problem 1: Complexity of Update Rules for Logistic Regression Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 16 th, 2014 1

More information

CS573 Data Privacy and Security. Li Xiong

CS573 Data Privacy and Security. Li Xiong CS573 Data Privacy and Security Anonymizationmethods Li Xiong Today Clustering based anonymization(cont) Permutation based anonymization Other privacy principles Microaggregation/Clustering Two steps:

More information

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel Indexing Week 14, Spring 2005 Edited by M. Naci Akkøk, 5.3.2004, 3.3.2005 Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel Overview Conventional indexes B-trees Hashing schemes

More information

Bayesian Spherical Wavelet Shrinkage: Applications to Shape Analysis

Bayesian Spherical Wavelet Shrinkage: Applications to Shape Analysis Bayesian Spherical Wavelet Shrinkage: Applications to Shape Analysis Xavier Le Faucheur a, Brani Vidakovic b and Allen Tannenbaum a a School of Electrical and Computer Engineering, b Department of Biomedical

More information

Nearest Neighbor with KD Trees

Nearest Neighbor with KD Trees Case Study 2: Document Retrieval Finding Similar Documents Using Nearest Neighbors Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Emily Fox January 22 nd, 2013 1 Nearest

More information

Meta-Clustering. Parasaran Raman PhD Candidate School of Computing

Meta-Clustering. Parasaran Raman PhD Candidate School of Computing Meta-Clustering Parasaran Raman PhD Candidate School of Computing What is Clustering? Goal: Group similar items together Unsupervised No labeling effort Popular choice for large-scale exploratory data

More information

MSA220 - Statistical Learning for Big Data

MSA220 - Statistical Learning for Big Data MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups

More information

Image Analysis & Retrieval. CS/EE 5590 Special Topics (Class Ids: 44873, 44874) Fall 2016, M/W Lec 18.

Image Analysis & Retrieval. CS/EE 5590 Special Topics (Class Ids: 44873, 44874) Fall 2016, M/W Lec 18. Image Analysis & Retrieval CS/EE 5590 Special Topics (Class Ids: 44873, 44874) Fall 2016, M/W 4-5:15pm@Bloch 0012 Lec 18 Image Hashing Zhu Li Dept of CSEE, UMKC Office: FH560E, Email: lizhu@umkc.edu, Ph:

More information

ELEC Dr Reji Mathew Electrical Engineering UNSW

ELEC Dr Reji Mathew Electrical Engineering UNSW ELEC 4622 Dr Reji Mathew Electrical Engineering UNSW Review of Motion Modelling and Estimation Introduction to Motion Modelling & Estimation Forward Motion Backward Motion Block Motion Estimation Motion

More information

EE795: Computer Vision and Intelligent Systems

EE795: Computer Vision and Intelligent Systems EE795: Computer Vision and Intelligent Systems Spring 2012 TTh 17:30-18:45 FDH 204 Lecture 14 130307 http://www.ee.unlv.edu/~b1morris/ecg795/ 2 Outline Review Stereo Dense Motion Estimation Translational

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Features and Patterns The Curse of Size and

More information

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation Data Mining Part 2. Data Understanding and Preparation 2.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Normalization Attribute Construction Aggregation Attribute Subset Selection Discretization

More information

Object Recognition Using Pictorial Structures. Daniel Huttenlocher Computer Science Department. In This Talk. Object recognition in computer vision

Object Recognition Using Pictorial Structures. Daniel Huttenlocher Computer Science Department. In This Talk. Object recognition in computer vision Object Recognition Using Pictorial Structures Daniel Huttenlocher Computer Science Department Joint work with Pedro Felzenszwalb, MIT AI Lab In This Talk Object recognition in computer vision Brief definition

More information

Outline. The History of Histograms. Yannis Ioannidis University of Athens, Hellas

Outline. The History of Histograms. Yannis Ioannidis University of Athens, Hellas The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline Prehistory Definitions and Framework The Early Past 10 Years Ago The Recent Past Industry Competitors The Future Prehistory

More information

Max-Count Aggregation Estimation for Moving Points

Max-Count Aggregation Estimation for Moving Points Max-Count Aggregation Estimation for Moving Points Yi Chen Peter Revesz Dept. of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln, NE 68588, USA Abstract Many interesting problems

More information

pcube: Update-Efficient Online Aggregation with Progressive Feedback and Error Bounds

pcube: Update-Efficient Online Aggregation with Progressive Feedback and Error Bounds pcube: Update-Efficient Online Aggregation with Progressive Feedback and Error Bounds Mirek Riedewald, Divyakant Agrawal, and Amr El Abbadi Department of Computer Science University of California, Santa

More information

Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase

Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase Bumjoon Jo and Sungwon Jung (&) Department of Computer Science and Engineering, Sogang University, 35 Baekbeom-ro, Mapo-gu, Seoul 04107,

More information

Applied Bayesian Nonparametrics 5. Spatial Models via Gaussian Processes, not MRFs Tutorial at CVPR 2012 Erik Sudderth Brown University

Applied Bayesian Nonparametrics 5. Spatial Models via Gaussian Processes, not MRFs Tutorial at CVPR 2012 Erik Sudderth Brown University Applied Bayesian Nonparametrics 5. Spatial Models via Gaussian Processes, not MRFs Tutorial at CVPR 2012 Erik Sudderth Brown University NIPS 2008: E. Sudderth & M. Jordan, Shared Segmentation of Natural

More information

Hidden Markov Models. Slides adapted from Joyce Ho, David Sontag, Geoffrey Hinton, Eric Xing, and Nicholas Ruozzi

Hidden Markov Models. Slides adapted from Joyce Ho, David Sontag, Geoffrey Hinton, Eric Xing, and Nicholas Ruozzi Hidden Markov Models Slides adapted from Joyce Ho, David Sontag, Geoffrey Hinton, Eric Xing, and Nicholas Ruozzi Sequential Data Time-series: Stock market, weather, speech, video Ordered: Text, genes Sequential

More information

Methods for Intelligent Systems

Methods for Intelligent Systems Methods for Intelligent Systems Lecture Notes on Clustering (II) Davide Eynard eynard@elet.polimi.it Department of Electronics and Information Politecnico di Milano Davide Eynard - Lecture Notes on Clustering

More information

Chapter 5 Efficient Memory Information Retrieval

Chapter 5 Efficient Memory Information Retrieval Chapter 5 Efficient Memory Information Retrieval In this chapter, we will talk about two topics: (1) What is a kd-tree? (2) How can we use kdtrees to speed up the memory-based learning algorithms? Since

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Privacy preserving data mining Li Xiong Slides credits: Chris Clifton Agrawal and Srikant 4/3/2011 1 Privacy Preserving Data Mining Privacy concerns about personal data AOL

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Object Classification Problem

Object Classification Problem HIERARCHICAL OBJECT CATEGORIZATION" Gregory Griffin and Pietro Perona. Learning and Using Taxonomies For Fast Visual Categorization. CVPR 2008 Marcin Marszalek and Cordelia Schmid. Constructing Category

More information

Organizing Spatial Data

Organizing Spatial Data Organizing Spatial Data Spatial data records include a sense of location as an attribute. Typically location is represented by coordinate data (in 2D or 3D). 1 If we are to search spatial data using the

More information

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition Pattern Recognition Kjell Elenius Speech, Music and Hearing KTH March 29, 2007 Speech recognition 2007 1 Ch 4. Pattern Recognition 1(3) Bayes Decision Theory Minimum-Error-Rate Decision Rules Discriminant

More information