Anonymization Algorithms - Microaggregation and Clustering

Size: px
Start display at page:

Download "Anonymization Algorithms - Microaggregation and Clustering"

Transcription

1 Anonymization Algorithms - Microaggregation and Clustering Li Xiong CS573 Data Privacy and Anonymity

2 Anonymization using Microaggregation or Clustering Practical Data-Oriented Microaggregation for Statistical Disclosure Control, Domingo-Ferrer, TKDE 2002 Ordinal, Continuous and Heterogeneous k-anonymity through microaggregation, Domingo-Ferrer, DMKD 2005 Achieving anonymity via clustering, Aggarwal, PODS 2006 Efficient k-anonymization using clustering techniques, Byun, DASFAA 2007

3 Anonymization Methods Perturbative: distort the data statistics computed on the perturbed dataset should not differ significantly from the original microaggregation, additive noise Non-perturbative: don't distort the data generalization: combine several categories to form new less specific category suppression: remove values of a few attributes in some records, or entire records

4 Types of data Continuous: attribute is numeric and arithmetic operations can be performed on it Categorical: attribute takes values over a finite set and standard arithmetic operations don't make sense Ordinal: ordered range of categories, min and max operations are meaningful Nominal: unordered only equality comparison operation is meaningful

5 Measure tradeoffs k-anonymity: a dataset satisfies k-anonymity for k > 1 if at least k records exist for each combination of quasiidentifier values assuming k-anonymity is enough protection against disclosure risk, one can concentrate on information loss measures

6 Critique of Generalization/Suppression Satisfying k-anonymity using generalization and suppression is NP-hard Computational cost of finding the optimal generalization How to determine the subset of appropriate generalizations semantics of categories and intended use of data e.g., ZIP code: {08201, 08205} -> 0820* makes sense {08201, 05201} -> 0*201 doesn't

7 Problems cont. How to apply a generalization globally may generalize records that don't need it locally difficult to automate and analyze number of generalizations is even larger Generalization and suppression on continuous data are unsuitable a numeric attribute becomes categorical and loses its numeric semantics

8 Problems cont. How to optimally combine generalization and suppression is unknown Use of suppression is not homogenous suppress entire records or only some attributes of some records blank a suppressed value or replace it with a neutral value

9 Microaggregation/Clustering Two steps: Partition original dataset into clusters of similar records containing at least k records For each cluster, compute an aggregation operation and use it to replace the original records e.g., mean for continuous data, median for categorical data

10 Advantages: a unified approach, unlike combination of generalization and suppression Near-optimal heuristics exist Doesn't generate new categories Suitable for continuous data without removing their numeric semantics

11 Advantages cont. Reduces data distortion K-anonymity requires an attribute to be generalized or suppressed, even if all but one tuple in the set have the same value. Clustering allows a cluster center to be published instead, enabling us to release more information.

12 Original Table Age Salary Amy Brian Carol David Evelyn

13 2-Anonymity with Generalization Age Salary Amy Brian Carol David Evelyn Generalization allows pre-specified ranges

14 2-Anonymity with Clustering Age Salary Amy [25-29] [50-100] Brian [25-29] [50-100] Carol [25-29] [50-100] David [35-39] [ ] Evelyn [35-39] [ ] 27=( )/3 70=( )/3 37=(35+39)/2 115=( )/2 Cluster centers ([27,70], [37,115]) published

15 Another example: no common value among each attribute

16 Generalization vs. clustering Generalized version of the table would need to suppress all attributes. Clustered Version of the table would publish the cluster center as (1, 1, 1, 1), and the radius as 1.

17 Anonymization using Microaggregation or Clustering Practical Data-Oriented Microaggregation for Statistical Disclosure Control, Domingo-Ferrer, TKDE 2002 Ordinal, Continuous and Heterogeneous k-anonymity through microaggregation, Domingo-Ferrer, DMKD 2005 Achieving anonymity via clustering, Aggarwal, PODS 2006 Efficient k-anonymization using clustering techniques, Byun, DASFAA 2007

18 Multivariate microaggregation algorithm MDAV-generic: Generic version of MDAV algorithm (Maximum Distance to Average Vector) from previous papers Works with any type of data (continuous, ordinal, nominal), aggregation operator and distance calculation

19 MDAV-generic(R: dataset, k: integer) while R 3k 1. compute average record ~x of all records in R 2. find most distant record x r from ~x 3. find most distant record x s from x r 4. form two clusters from k-1 records closest to x r and k-1 closest to x s 5. Remove the clusters from R and run MDAV-generic on the remaining dataset end while if 3k-1 R 2k 1. compute average record ~x of remaining records in R 2. find the most distant record x r from ~x 3. form a cluster from k-1 records closest to ~x 4. form another cluster containing the remaining records else (fewer than 2k records in R) form a new cluster from the remaining records

20 MDAV-generic for continuous attributes use arithmetic mean and Euclidean distance standardize attributes (subtract mean and divide by standard deviation) to give them equal weight for computing distances After MDAV-generic, destandardize attributes x ij is value of k-anonymized jth attribute for the ith record m 0 1 (j) and m 2 (j) are mean and variance of the k-anonymized jth attribute u 0 1 (j) and u 2 (j) are mean and variance of the original jth attribute

21 MDAV-generic for ordinal attributes The distance between two categories a and b in an attribute V i : d ord (a,b) = ( {i i < b} ) / D(V i ) i.e., the number of categories separating a and b divided by the number of categories in the attribute Nominal attributes The distance between two values is defined according to equality: 0 if they're equal, else 1

22 Empirical Results Continuous attributes From the U.S. Current Population Survey (1995) 1080 records described by 13 continuous attributes Computed k-anonymity for k = 3,..., 9 and quasiidentifiers with 6 and 13 attributes Categorical attributes From the U.S. Housing Survey (1993) Three ordinal and eight nominal attributes Computed k-anonymity for k = 2,..., 9 and quasiidentifiers with 3, 4, 8 and 11 attributes

23 IL measures for continuous attributes IL1 = mean variation of individual attributes in original and k-anonymous datasets IL2 = mean variation of attribute means in both datasets IL3 = mean variation of attribute variances IL4 = mean variation of attribute covariances IL5 = mean variation of attribute Pearson's correlations IL6 = 100 times the average of IL1-6

24 MDAV-generic preserves means and variances The impact on the non-preserved statistics grows with the quasiidentifier length, as one would expect For a fixed-quasi-identifier length, the impact on the non-preserved statistics grows with k

25 IL measures for categorical attributes Dist: direct comparison of original and protected values using a categorical distance CTBIL': mean variation of frequencies in contingency tables for original and protected data (based on another paper by Domingo-Ferrer and Torra) ACTBIL': CTBIL' divided by the total number of cells in all considered tables EBIL: Entropy-based information loss (based on another paper by Domingo-Ferrer and Torra)

26 Ordinal attribute protection using median

27 Ordinal attribute protection using convex median

28 Anonymization using Microaggregation or Clustering Practical Data-Oriented Microaggregation for Statistical Disclosure Control, Domingo-Ferrer, TKDE 2002 Ordinal, Continuous and Heterogeneous k-anonymity through microaggregation, Domingo-Ferrer, DMKD 2005 Achieving anonymity via clustering, Aggarwal, PODS 2006 Efficient k-anonymization using clustering techniques, Byun, DASFAA 2007

29 r-clustering Attributes from a table are first redefined as points in metric space. These points are clustered, and then the cluster centers are published, rather than the original quasi-identifiers. r is the lower bound on the number of members in each cluster. r is used instead of k to denote the minimum degree of anonymity because k is typically used in clustering to denote the number of clusters.

30 Data published for clusters Three features are published for the clustered data the quasi-identifying attributes of the cluster center the number of points within the cluster the set of sensitive values for the cluster (which remain unchanged, as with k- anonymity) A measure of the quality of the clusters will also be published.

31 Defining the records in metric space Some attributes, such as age and height, are easily mapped to metric space. Others, such as zip may first need to be converted, for example to longitude and latitude. Some attributes may need to be scaled, such as location, which may differ by thousands of miles. Some attributes such as race or nationality may not convert to points in metric space easily.

32 How to measure the quality of the cluster Measures how much it distorts the original data. Maximum radius (r-gather problem) Maximum radius of all clusters Cellular cost (r-cellular CLUSTERING problem) Each cluster incurs a facility cost to set up the cluster center. Each cluster incurs a service cost which is equal to the radius times the number of points in the cluster Sum of the facility and services costs for each of the clusters.

33 Points arranged in clusters 17 points, radius 7 25 points, radius points, radius 8

34 Cluster quality measurements Maximum radius = 10 Facility cost plus service cost: Facility cost = f(c) Service cost = (17 x 7) + (14 x 8) + (25 x 10) = 481

35 r-gather problem The r-gather problem is to cluster n points in a metric space into a set of clusters, such that each cluster has at least r points. The objective is to minimize the maximum radius among the clusters.

36 Outlier points r-gather and r-cellular CLUSTERING, like k-anonymity, are sensitive to outlier points (i.e., points which are far removed from the rest of the data). The clustering solutions in this paper are generalized to allow an e fraction of outliers to be removed from the data, that is, e fraction of the tuples can be suppressed.

37 (r,e)-gather Clustering The (r, e)-gather clustering formulation of the problem allows an e fraction of the outlier points to be unclustered (i.e., these tuples are suppressed). The paper finds that there is a polynomial time algorithm that provides a 4- approximation for the (r,e)-gather problem.

38 r-cellular CLUSTERING defined The CELLULAR CLUSTERING problem is to arrange n points into clusters with each cluster has at least r points and with the minimum total cellular cost.

39 (r,e)-cellular CLUSTERING There is also a (r,e)-cellular CLUSTERING problem in which an e fraction of the points can be excluded. The details of the constant-factor approximation of this problem are deferred to the full version of this paper.

40 Anonymization using Microaggregation or Clustering Practical Data-Oriented Microaggregation for Statistical Disclosure Control, Domingo-Ferrer, TKDE 2002 Ordinal, Continuous and Heterogeneous k-anonymity through microaggregation, Domingo-Ferrer, DMKD 2005 Achieving anonymity via clustering, Aggarwal, PODS 2006 Efficient k-anonymization using clustering techniques, Byun, DASFAA 2007

41 Anonymization And Clustering k-member Clustering Problem From a given set of n records, find a set of clusters such that Each cluster contains at least k records, and The total intra-cluster distance is minimized. The problem is NP-complete 41

42 Distance Metrics Distance metric for records Measure the dissimilarities between two data points Sum of all dissimilarities between corresponding attributes. Numerical values Categorical values 42

43 Distance between two numerical values Definition Let D be a finite numeric domain. Then the normalized distance between two values vi, vj D is defined as: where D is the domain size measured by the difference between the maximum and minimum values in D. Age Country Occupation Salary Diagnosis r1 41 USA Armed-Forces 50K Cancer r2 57 India Tech-support <50K Flu r3 40 Canada Teacher <50K Obesity r4 38 Iran Tech-support 50K Flu r5 24 Brazil Doctor 50K Cancer r6 45 Greece Salesman <50K Fever Example 1 Distance between r1 and r2 with respect to Age attribute is / = 16/33 = Example 2 Distance between r5 and r6 with respect to Age attribute is / = 21/33 =

44 Distance between two categorical values Equally different to each other. 0 if they are the same 1 if they are different Relationships can be easily captured in a taxonomy tree. Taxonomy tree of Country Taxonomy tree of Occupation 44

45 Distance between two categorical values Definition Let D be a categorical domain and TD be a taxonomy tree defined for D. The normalized distance between two values vi, vj D is defined as: Taxonomy tree of Country where (x, y) is the subtree rooted at the lowest common ancestor of x and y, and H(T) represents the height of tree T. Example: The distance between India and USA is 3/3 = 1. The distance between India and Iran is 2/3 =

46 Distance between two records Definition Let QT = {N1,...,Nm, C1,...,Cn} be the quasi-identifier of table T, where Ni(i = 1,...,m) is an attribute with a numeric domain and Ci(i = 1,..., n) is an attribute with a categorical domain. The distance of two records r1, r2 T is defined as: where δn is the distance function for numeric attribute, and δc is the distance function for categorical attribute. 46

47 Distance between two records Continued Age Country Occupation Salary Diagno sis r1 41 USA Armed-Forces 50K Cancer r2 57 India Tech-support <50K Flu r3 40 Canada Teacher <50K Obesity r4 38 Iran Tech-support 50K Flu r5 24 Brazil Doctor 50K Cancer r6 45 Greece Salesman <50K Fever Example Taxonomy tree of Country the distance between the r1 and r2 is (16/33) + (3/3) + 1 = the distance between the r1 and r3 is (1/33) + (1/3) + 1 = Taxonomy tree of Occupation 47

48 Cost Function - Information loss (IL) The amount of distortion (i.e., information loss) caused by the generalization process. Note: Records in each cluster are generalized to share the same quasi-identifier value that represents every original quasiidentifier value in the cluster. Definition: Let e = {r1,..., rk} be a cluster (i.e., equivalence class). Then the amount of information loss in e, denoted by IL(e), is defined as: where e is the number of records in e, N represents the size of numeric domain N, ( Cj) is the subtree rooted at the lowest common ancestor of every value in Cj, and H(T) is the height of tree T. 48

49 Cost Function - Information loss (IL) Example Taxonomy tree of Country Age Country Occupation Salary Diagnosis r1 41 USA Armed-Forces 50K Cancer r2 57 India Tech-support <50K Flu r3 40 Canada Teacher <50K Obesity r4 38 Iran Tech-support 50K Flu r5 24 Brazil Doctor 50K Cancer r6 45 Greece Salesman <50K Fever Cluster e1 Age Country Occupation Salary Diagnosis 41 USA Armed-Forces 50K Cancer 40 Canada Teacher <50K Obesity 24 Brazil Doctor 50K Cancer IL(e1) = 3 D(e1) D(e1) = (41-24)/33 + (2/3) + 1 = IL(e1) = = Cluster e2 Age Country Occupation Salary Diagnosis 41 USA Armed-Forces 50K Cancer 57 India Tech-support <50K Flu 24 Brazil Doctor 50K Cancer IL(e2) = 3 D(e2) D(e2) = (57-24)/33 + (3/3) + 1 = 3 IL(e2) = 3 3 = 9 49

50 Greedy k-member clustering algorithm 50

51 Diversity Metrics The Equal Diversity metric (ED) assumes all sensitive attribute values are equally sensitive where φ(e, s) = 1 if every record in e has the same s value; φ(e, s) = 0, otherwise. Modification to the greedy algorithm: Sensitive Diversity metric (SD) assumes there are two types of values in a sensitive attribute: truly-sensitive not-so-sensitive where ψ(e, s) = 1 if every record in e has the same s value that is trulysensitive; ψ(e, s) = 0, otherwise Modification to the greedy algorithm 51

52 classification metric (CM) preserve the correlation between quasiidentifier and class labels (non-sensitive values) Where N is the total number of records, and Penalty(row r) = 1 if r is suppressed or the class label of r is different from the class label of the majority in the equivalence group. Modification to the greedy algorithm: 52

53 Experimentl Results Experimentl Setup Data: Adult dataset from the UC Irvine Machine Learning Repository 10 attributes (2 numeric, 7 categorical, 1 class) Compare with 2 other algorithms Median partitioning (mondrian algorithm) k-nearest neighbor 53

54 Experimentl Results 54

55 Conclusion Transforming the k-anonymity problem to the k-member clustering problem Overall the Greedy Algorithm produced better results compared to other algorithms at the cost of efficiency 55

CS573 Data Privacy and Security. Li Xiong

CS573 Data Privacy and Security. Li Xiong CS573 Data Privacy and Security Anonymizationmethods Li Xiong Today Clustering based anonymization(cont) Permutation based anonymization Other privacy principles Microaggregation/Clustering Two steps:

More information

Efficient k-anonymization Using Clustering Techniques

Efficient k-anonymization Using Clustering Techniques Efficient k-anonymization Using Clustering Techniques Ji-Won Byun 1,AshishKamra 2, Elisa Bertino 1, and Ninghui Li 1 1 CERIAS and Computer Science, Purdue University {byunj, bertino, ninghui}@cs.purdue.edu

More information

An Efficient Clustering Method for k-anonymization

An Efficient Clustering Method for k-anonymization An Efficient Clustering Method for -Anonymization Jun-Lin Lin Department of Information Management Yuan Ze University Chung-Li, Taiwan jun@saturn.yzu.edu.tw Meng-Cheng Wei Department of Information Management

More information

Ordinal, Continuous and Heterogeneous k-anonymity Through Microaggregation

Ordinal, Continuous and Heterogeneous k-anonymity Through Microaggregation Data Mining and Knowledge Discovery, 11, 195 212, 2005 c 2005 Springer Science + Business Media, Inc. Manufactured in The Netherlands. DOI: 10.1007/s10618-005-0007-5 Ordinal, Continuous and Heterogeneous

More information

Achieving k-anonmity* Privacy Protection Using Generalization and Suppression

Achieving k-anonmity* Privacy Protection Using Generalization and Suppression UT DALLAS Erik Jonsson School of Engineering & Computer Science Achieving k-anonmity* Privacy Protection Using Generalization and Suppression Murat Kantarcioglu Based on Sweeney 2002 paper Releasing Private

More information

K-Anonymity and Other Cluster- Based Methods. Ge Ruan Oct. 11,2007

K-Anonymity and Other Cluster- Based Methods. Ge Ruan Oct. 11,2007 K-Anonymity and Other Cluster- Based Methods Ge Ruan Oct 11,2007 Data Publishing and Data Privacy Society is experiencing exponential growth in the number and variety of data collections containing person-specific

More information

Optimal k-anonymity with Flexible Generalization Schemes through Bottom-up Searching

Optimal k-anonymity with Flexible Generalization Schemes through Bottom-up Searching Optimal k-anonymity with Flexible Generalization Schemes through Bottom-up Searching Tiancheng Li Ninghui Li CERIAS and Department of Computer Science, Purdue University 250 N. University Street, West

More information

Security Control Methods for Statistical Database

Security Control Methods for Statistical Database Security Control Methods for Statistical Database Li Xiong CS573 Data Privacy and Security Statistical Database A statistical database is a database which provides statistics on subsets of records OLAP

More information

Workload Characterization Techniques

Workload Characterization Techniques Workload Characterization Techniques Raj Jain Washington University in Saint Louis Saint Louis, MO 63130 Jain@cse.wustl.edu These slides are available on-line at: http://www.cse.wustl.edu/~jain/cse567-08/

More information

Data Anonymization - Generalization Algorithms

Data Anonymization - Generalization Algorithms Data Anonymization - Generalization Algorithms Li Xiong CS573 Data Privacy and Anonymity Generalization and Suppression Z2 = {410**} Z1 = {4107*. 4109*} Generalization Replace the value with a less specific

More information

Partition Based Perturbation for Privacy Preserving Distributed Data Mining

Partition Based Perturbation for Privacy Preserving Distributed Data Mining BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 17, No 2 Sofia 2017 Print ISSN: 1311-9702; Online ISSN: 1314-4081 DOI: 10.1515/cait-2017-0015 Partition Based Perturbation

More information

Jeff Howbert Introduction to Machine Learning Winter

Jeff Howbert Introduction to Machine Learning Winter Collaborative Filtering Nearest es Neighbor Approach Jeff Howbert Introduction to Machine Learning Winter 2012 1 Bad news Netflix Prize data no longer available to public. Just after contest t ended d

More information

Automated Information Retrieval System Using Correlation Based Multi- Document Summarization Method

Automated Information Retrieval System Using Correlation Based Multi- Document Summarization Method Automated Information Retrieval System Using Correlation Based Multi- Document Summarization Method Dr.K.P.Kaliyamurthie HOD, Department of CSE, Bharath University, Tamilnadu, India ABSTRACT: Automated

More information

Hierarchical Clustering

Hierarchical Clustering What is clustering Partitioning of a data set into subsets. A cluster is a group of relatively homogeneous cases or observations Hierarchical Clustering Mikhail Dozmorov Fall 2016 2/61 What is clustering

More information

Emerging Measures in Preserving Privacy for Publishing The Data

Emerging Measures in Preserving Privacy for Publishing The Data Emerging Measures in Preserving Privacy for Publishing The Data K.SIVARAMAN 1 Assistant Professor, Dept. of Computer Science, BIST, Bharath University, Chennai -600073 1 ABSTRACT: The information in the

More information

ACHIEVING ANONYMITY VIA CLUSTERING

ACHIEVING ANONYMITY VIA CLUSTERING ACHIEVING ANONYMITY VIA CLUSTERING GAGAN AGGARWAL, TOMÁS FEDER, KRISHNARAM KENTHAPADI, SAMIR KHULLER, RINA PANIGRAHY, DILYS THOMAS, AND AN ZHU Abstract. Publishing data for analysis from a table containing

More information

Survey of Anonymity Techniques for Privacy Preserving

Survey of Anonymity Techniques for Privacy Preserving 2009 International Symposium on Computing, Communication, and Control (ISCCC 2009) Proc.of CSIT vol.1 (2011) (2011) IACSIT Press, Singapore Survey of Anonymity Techniques for Privacy Preserving Luo Yongcheng

More information

Achieving Anonymity via Clustering

Achieving Anonymity via Clustering Achieving Anonymity via Clustering GAGAN AGGARWAL Google Inc., Mountain View TOMÁS FEDER Computer Science Department Stanford University, Stanford KRISHNARAM KENTHAPADI Microsoft Research, Mountain View

More information

Improving Privacy And Data Utility For High- Dimensional Data By Using Anonymization Technique

Improving Privacy And Data Utility For High- Dimensional Data By Using Anonymization Technique Improving Privacy And Data Utility For High- Dimensional Data By Using Anonymization Technique P.Nithya 1, V.Karpagam 2 PG Scholar, Department of Software Engineering, Sri Ramakrishna Engineering College,

More information

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York Clustering Robert M. Haralick Computer Science, Graduate Center City University of New York Outline K-means 1 K-means 2 3 4 5 Clustering K-means The purpose of clustering is to determine the similarity

More information

Achieving Anonymity via Clustering

Achieving Anonymity via Clustering Achieving Anonymity via Clustering Gagan Aggarwal 1 Tomás Feder 2 Krishnaram Kenthapadi 2 Samir Khuller 3 Rina Panigrahy 2,4 Dilys Thomas 2 An Zhu 1 ABSTRACT Publishing data for analysis from a table containing

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

Gene expression & Clustering (Chapter 10)

Gene expression & Clustering (Chapter 10) Gene expression & Clustering (Chapter 10) Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species Dynamic programming Approximate pattern matching

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

SOCIAL MEDIA MINING. Data Mining Essentials

SOCIAL MEDIA MINING. Data Mining Essentials SOCIAL MEDIA MINING Data Mining Essentials Dear instructors/users of these slides: Please feel free to include these slides in your own material, or modify them as you see fit. If you decide to incorporate

More information

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,

More information

A FUZZY BASED APPROACH FOR PRIVACY PRESERVING CLUSTERING

A FUZZY BASED APPROACH FOR PRIVACY PRESERVING CLUSTERING A FUZZY BASED APPROACH FOR PRIVACY PRESERVING CLUSTERING 1 B.KARTHIKEYAN, 2 G.MANIKANDAN, 3 V.VAITHIYANATHAN 1 Assistant Professor, School of Computing, SASTRA University, TamilNadu, India. 2 Assistant

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha Data Preprocessing S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha 1 Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking

More information

Secured Medical Data Publication & Measure the Privacy Closeness Using Earth Mover Distance (EMD)

Secured Medical Data Publication & Measure the Privacy Closeness Using Earth Mover Distance (EMD) Vol.2, Issue.1, Jan-Feb 2012 pp-208-212 ISSN: 2249-6645 Secured Medical Data Publication & Measure the Privacy Closeness Using Earth Mover Distance (EMD) Krishna.V #, Santhana Lakshmi. S * # PG Student,

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster

More information

Statistical Disclosure Control meets Recommender Systems: A practical approach

Statistical Disclosure Control meets Recommender Systems: A practical approach Research Group Statistical Disclosure Control meets Recommender Systems: A practical approach Fran Casino and Agusti Solanas {franciscojose.casino, agusti.solanas}@urv.cat Smart Health Research Group Universitat

More information

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things.

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. + What is Data? Data is a collection of facts. Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. In most cases, data needs to be interpreted and

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

Evolutionary tree reconstruction (Chapter 10)

Evolutionary tree reconstruction (Chapter 10) Evolutionary tree reconstruction (Chapter 10) Early Evolutionary Studies Anatomical features were the dominant criteria used to derive evolutionary relationships between species since Darwin till early

More information

Evaluating the Classification Accuracy of Data Mining Algorithms for Anonymized Data

Evaluating the Classification Accuracy of Data Mining Algorithms for Anonymized Data International Journal of Computer Science and Telecommunications [Volume 3, Issue 8, August 2012] 63 ISSN 2047-3338 Evaluating the Classification Accuracy of Data Mining Algorithms for Anonymized Data

More information

PRACTICAL K-ANONYMITY ON LARGE DATASETS. Benjamin Podgursky. Thesis. Submitted to the Faculty of the. Graduate School of Vanderbilt University

PRACTICAL K-ANONYMITY ON LARGE DATASETS. Benjamin Podgursky. Thesis. Submitted to the Faculty of the. Graduate School of Vanderbilt University PRACTICAL K-ANONYMITY ON LARGE DATASETS By Benjamin Podgursky Thesis Submitted to the Faculty of the Graduate School of Vanderbilt University in partial fulfillment of the requirements for the degree of

More information

Preserving Privacy during Big Data Publishing using K-Anonymity Model A Survey

Preserving Privacy during Big Data Publishing using K-Anonymity Model A Survey ISSN No. 0976-5697 Volume 8, No. 5, May-June 2017 International Journal of Advanced Research in Computer Science SURVEY REPORT Available Online at www.ijarcs.info Preserving Privacy during Big Data Publishing

More information

Comparative Analysis of Anonymization Techniques

Comparative Analysis of Anonymization Techniques International Journal of Electronic and Electrical Engineering. ISSN 0974-2174 Volume 7, Number 8 (2014), pp. 773-778 International Research Publication House http://www.irphouse.com Comparative Analysis

More information

Disclosure Control by Computer Scientists: An Overview and an Application of Microaggregation to Mobility Data Anonymization

Disclosure Control by Computer Scientists: An Overview and an Application of Microaggregation to Mobility Data Anonymization Int. Statistical Inst.: Proc. 58th World Statistical Congress, 2011, Dublin (Session IPS060) p.1082 Disclosure Control by Computer Scientists: An Overview and an Application of Microaggregation to Mobility

More information

10701 Machine Learning. Clustering

10701 Machine Learning. Clustering 171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among

More information

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation = noisy: containing

More information

Secure Multiparty Computation Multi-round protocols

Secure Multiparty Computation Multi-round protocols Secure Multiparty Computation Multi-round protocols Li Xiong CS573 Data Privacy and Security Secure multiparty computation General circuit based secure multiparty computation methods Specialized secure

More information

Fuzzy methods for database protection

Fuzzy methods for database protection EUSFLAT-LFA 2011 July 2011 Aix-les-Bains, France Fuzzy methods for database protection Vicenç Torra1 Daniel Abril1 Guillermo Navarro-Arribas2 1 IIIA, Artificial Intelligence Research Institute CSIC, Spanish

More information

K ANONYMITY. Xiaoyong Zhou

K ANONYMITY. Xiaoyong Zhou K ANONYMITY LATANYA SWEENEY Xiaoyong Zhou DATA releasing: Privacy vs. Utility Society is experiencing exponential growth in the number and variety of data collections containing person specific specific

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing Data Mining: Concepts and Techniques (3 rd ed.) Chapter 3 1 Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major Tasks in Data Preprocessing Data Cleaning Data Integration Data

More information

CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008

CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008 CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008 Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute NAME: Prof. Ruiz Problem

More information

Clustering: Centroid-Based Partitioning

Clustering: Centroid-Based Partitioning Clustering: Centroid-Based Partitioning Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong 1 / 29 Y Tao Clustering: Centroid-Based Partitioning In this lecture, we

More information

Cluster analysis. Agnieszka Nowak - Brzezinska

Cluster analysis. Agnieszka Nowak - Brzezinska Cluster analysis Agnieszka Nowak - Brzezinska Outline of lecture What is cluster analysis? Clustering algorithms Measures of Cluster Validity What is Cluster Analysis? Finding groups of objects such that

More information

Cluster Analysis. CSE634 Data Mining

Cluster Analysis. CSE634 Data Mining Cluster Analysis CSE634 Data Mining Agenda Introduction Clustering Requirements Data Representation Partitioning Methods K-Means Clustering K-Medoids Clustering Constrained K-Means clustering Introduction

More information

SIMPLE AND EFFECTIVE METHOD FOR SELECTING QUASI-IDENTIFIER

SIMPLE AND EFFECTIVE METHOD FOR SELECTING QUASI-IDENTIFIER 31 st July 216. Vol.89. No.2 25-216 JATIT & LLS. All rights reserved. SIMPLE AND EFFECTIVE METHOD FOR SELECTING QUASI-IDENTIFIER 1 AMANI MAHAGOUB OMER, 2 MOHD MURTADHA BIN MOHAMAD 1 Faculty of Computing,

More information

A COMPARATIVE STUDY ON MICROAGGREGATION TECHNIQUES FOR MICRODATA PROTECTION

A COMPARATIVE STUDY ON MICROAGGREGATION TECHNIQUES FOR MICRODATA PROTECTION A COMPARATIVE STUDY ON MICROAGGREGATION TECHNIQUES FOR MICRODATA PROTECTION Sarat Kumar Chettri 1, Bonani Paul 2 and Ajoy Krishna Dutta 2 1 Department of Computer Science, Saint Mary s College, Shillong,

More information

DATA ANALYSIS I. Types of Attributes Sparse, Incomplete, Inaccurate Data

DATA ANALYSIS I. Types of Attributes Sparse, Incomplete, Inaccurate Data DATA ANALYSIS I Types of Attributes Sparse, Incomplete, Inaccurate Data Sources Bramer, M. (2013). Principles of data mining. Springer. [12-21] Witten, I. H., Frank, E. (2011). Data Mining: Practical machine

More information

Clustering. Lecture 6, 1/24/03 ECS289A

Clustering. Lecture 6, 1/24/03 ECS289A Clustering Lecture 6, 1/24/03 What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques Chapter 7.1-4 March 8, 2007 Data Mining: Concepts and Techniques 1 1. What is Cluster Analysis? 2. Types of Data in Cluster Analysis Chapter 7 Cluster Analysis 3. A

More information

Utility-Preserving Differentially Private Data Releases Via Individual Ranking Microaggregation

Utility-Preserving Differentially Private Data Releases Via Individual Ranking Microaggregation Utility-Preserving Differentially Private Data Releases Via Individual Ranking Microaggregation David Sánchez, Josep Domingo-Ferrer, Sergio Martínez, and Jordi Soria-Comas UNESCO Chair in Data Privacy,

More information

(α, k)-anonymity: An Enhanced k-anonymity Model for Privacy-Preserving Data Publishing

(α, k)-anonymity: An Enhanced k-anonymity Model for Privacy-Preserving Data Publishing (α, k)-anonymity: An Enhanced k-anonymity Model for Privacy-Preserving Data Publishing Raymond Chi-Wing Wong, Jiuyong Li +, Ada Wai-Chee Fu and Ke Wang Department of Computer Science and Engineering +

More information

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order. Chapter 2 2.1 Descriptive Statistics A stem-and-leaf graph, also called a stemplot, allows for a nice overview of quantitative data without losing information on individual observations. It can be a good

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Privacy preserving data mining Li Xiong Slides credits: Chris Clifton Agrawal and Srikant 4/3/2011 1 Privacy Preserving Data Mining Privacy concerns about personal data AOL

More information

Data Preprocessing. Komate AMPHAWAN

Data Preprocessing. Komate AMPHAWAN Data Preprocessing Komate AMPHAWAN 1 Data cleaning (data cleansing) Attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data. 2 Missing value

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Features and Patterns The Curse of Size and

More information

Achieving Anonymity via Clustering

Achieving Anonymity via Clustering Achieving Anonymity via Clustering Gagan Aggrawal 1 Google Inc. Mountain View, CA 94043 gagan@cs.stanford.edu Samir Khuller 3 Dept. Comp. Sci. University of Maryland College Park, MD 20742 samir@cs.umd.edu

More information

Road map. Basic concepts

Road map. Basic concepts Clustering Basic concepts Road map K-means algorithm Representation of clusters Hierarchical clustering Distance functions Data standardization Handling mixed attributes Which clustering algorithm to use?

More information

Data Security and Privacy. Topic 18: k-anonymity, l-diversity, and t-closeness

Data Security and Privacy. Topic 18: k-anonymity, l-diversity, and t-closeness Data Security and Privacy Topic 18: k-anonymity, l-diversity, and t-closeness 1 Optional Readings for This Lecture t-closeness: Privacy Beyond k-anonymity and l-diversity. Ninghui Li, Tiancheng Li, and

More information

Co-clustering for differentially private synthetic data generation

Co-clustering for differentially private synthetic data generation Co-clustering for differentially private synthetic data generation Tarek Benkhelif, Françoise Fessant, Fabrice Clérot and Guillaume Raschia January 23, 2018 Orange Labs & LS2N Journée thématique EGC &

More information

L-Diversity Algorithm for Incremental Data Release

L-Diversity Algorithm for Incremental Data Release Appl. ath. Inf. Sci. 7, No. 5, 2055-2060 (203) 2055 Applied athematics & Information Sciences An International Journal http://dx.doi.org/0.2785/amis/070546 L-Diversity Algorithm for Incremental Data Release

More information

Data Preprocessing. Slides by: Shree Jaswal

Data Preprocessing. Slides by: Shree Jaswal Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data

More information

Spatial Outlier Detection

Spatial Outlier Detection Spatial Outlier Detection Chang-Tien Lu Department of Computer Science Northern Virginia Center Virginia Tech Joint work with Dechang Chen, Yufeng Kou, Jiang Zhao 1 Spatial Outlier A spatial data point

More information

Road Map. Data types Measuring data Data cleaning Data integration Data transformation Data reduction Data discretization Summary

Road Map. Data types Measuring data Data cleaning Data integration Data transformation Data reduction Data discretization Summary 2. Data preprocessing Road Map Data types Measuring data Data cleaning Data integration Data transformation Data reduction Data discretization Summary 2 Data types Categorical vs. Numerical Scale types

More information

Privacy Preserving Data Publishing for Recommender System

Privacy Preserving Data Publishing for Recommender System IT 11 023 Examensarbete 30 hp Maj 2011 Privacy Preserving Data Publishing for Recommender System Xiaoqiang Chen Institutionen för informationsteknologi Department of Information Technology Abstract Privacy

More information

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation

More information

Incognito: Efficient Full Domain K Anonymity

Incognito: Efficient Full Domain K Anonymity Incognito: Efficient Full Domain K Anonymity Kristen LeFevre David J. DeWitt Raghu Ramakrishnan University of Wisconsin Madison 1210 West Dayton St. Madison, WI 53706 Talk Prepared By Parul Halwe(05305002)

More information

Clustering. Chapter 10 in Introduction to statistical learning

Clustering. Chapter 10 in Introduction to statistical learning Clustering Chapter 10 in Introduction to statistical learning 16 14 12 10 8 6 4 2 0 2 4 6 8 10 12 14 1 Clustering ² Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 1990). ² What

More information

Data Statistics Population. Census Sample Correlation... Statistical & Practical Significance. Qualitative Data Discrete Data Continuous Data

Data Statistics Population. Census Sample Correlation... Statistical & Practical Significance. Qualitative Data Discrete Data Continuous Data Data Statistics Population Census Sample Correlation... Voluntary Response Sample Statistical & Practical Significance Quantitative Data Qualitative Data Discrete Data Continuous Data Fewer vs Less Ratio

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 2 Sajjad Haider Spring 2010 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric

More information

Community Detection. Jian Pei: CMPT 741/459 Clustering (1) 2

Community Detection. Jian Pei: CMPT 741/459 Clustering (1) 2 Clustering Community Detection http://image.slidesharecdn.com/communitydetectionitilecturejune0-0609559-phpapp0/95/community-detection-in-social-media--78.jpg?cb=3087368 Jian Pei: CMPT 74/459 Clustering

More information

CS570: Introduction to Data Mining

CS570: Introduction to Data Mining CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu, Ph.D. Some slides courtesy of Li Xiong, Ph.D. and 2011 Han, Kamber & Pei. Data Mining. Morgan Kaufmann.

More information

Sequence clustering. Introduction. Clustering basics. Hierarchical clustering

Sequence clustering. Introduction. Clustering basics. Hierarchical clustering Sequence clustering Introduction Data clustering is one of the key tools used in various incarnations of data-mining - trying to make sense of large datasets. It is, thus, natural to ask whether clustering

More information

Parallel Composition Revisited

Parallel Composition Revisited Parallel Composition Revisited Chris Clifton 23 October 2017 This is joint work with Keith Merrill and Shawn Merrill This work supported by the U.S. Census Bureau under Cooperative Agreement CB16ADR0160002

More information

2. Data Preprocessing

2. Data Preprocessing 2. Data Preprocessing Contents of this Chapter 2.1 Introduction 2.2 Data cleaning 2.3 Data integration 2.4 Data transformation 2.5 Data reduction Reference: [Han and Kamber 2006, Chapter 2] SFU, CMPT 459

More information

3. Data Preprocessing. 3.1 Introduction

3. Data Preprocessing. 3.1 Introduction 3. Data Preprocessing Contents of this Chapter 3.1 Introduction 3.2 Data cleaning 3.3 Data integration 3.4 Data transformation 3.5 Data reduction SFU, CMPT 740, 03-3, Martin Ester 84 3.1 Introduction Motivation

More information

Protecting Against Maximum-Knowledge Adversaries in Microdata Release: Analysis of Masking and Synthetic Data Using the Permutation Model

Protecting Against Maximum-Knowledge Adversaries in Microdata Release: Analysis of Masking and Synthetic Data Using the Permutation Model Protecting Against Maximum-Knowledge Adversaries in Microdata Release: Analysis of Masking and Synthetic Data Using the Permutation Model Josep Domingo-Ferrer and Krishnamurty Muralidhar Universitat Rovira

More information

CHAPTER 4: CLUSTER ANALYSIS

CHAPTER 4: CLUSTER ANALYSIS CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis

More information

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms. Volume 3, Issue 5, May 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Survey of Clustering

More information

STA 570 Spring Lecture 5 Tuesday, Feb 1

STA 570 Spring Lecture 5 Tuesday, Feb 1 STA 570 Spring 2011 Lecture 5 Tuesday, Feb 1 Descriptive Statistics Summarizing Univariate Data o Standard Deviation, Empirical Rule, IQR o Boxplots Summarizing Bivariate Data o Contingency Tables o Row

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2012 http://ce.sharif.edu/courses/90-91/2/ce725-1/ Agenda Features and Patterns The Curse of Size and

More information

Clustering. Shishir K. Shah

Clustering. Shishir K. Shah Clustering Shishir K. Shah Acknowledgement: Notes by Profs. M. Pollefeys, R. Jin, B. Liu, Y. Ukrainitz, B. Sarel, D. Forsyth, M. Shah, K. Grauman, and S. K. Shah Clustering l Clustering is a technique

More information

Nearest Neighbor Predictors

Nearest Neighbor Predictors Nearest Neighbor Predictors September 2, 2018 Perhaps the simplest machine learning prediction method, from a conceptual point of view, and perhaps also the most unusual, is the nearest-neighbor method,

More information

Features: representation, normalization, selection. Chapter e-9

Features: representation, normalization, selection. Chapter e-9 Features: representation, normalization, selection Chapter e-9 1 Features Distinguish between instances (e.g. an image that you need to classify), and the features you create for an instance. Features

More information

On Privacy-Preservation of Text and Sparse Binary Data with Sketches

On Privacy-Preservation of Text and Sparse Binary Data with Sketches On Privacy-Preservation of Text and Sparse Binary Data with Sketches Charu C. Aggarwal Philip S. Yu Abstract In recent years, privacy preserving data mining has become very important because of the proliferation

More information

UNIT 2 Data Preprocessing

UNIT 2 Data Preprocessing UNIT 2 Data Preprocessing Lecture Topic ********************************************** Lecture 13 Why preprocess the data? Lecture 14 Lecture 15 Lecture 16 Lecture 17 Data cleaning Data integration and

More information

Achieving Anonymity via Clustering

Achieving Anonymity via Clustering Achieving Anonymity via Clustering GAGAN AGGARWAL Google Inc. TOMÁS FEDER Stanford University KRISHNARAM KENTHAPADI Microsoft Research SAMIR KHULLER University of Maryland RINA PANIGRAHY Microsoft Research

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

ECLT 5810 Data Preprocessing. Prof. Wai Lam

ECLT 5810 Data Preprocessing. Prof. Wai Lam ECLT 5810 Data Preprocessing Prof. Wai Lam Why Data Preprocessing? Data in the real world is imperfect incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate

More information

Clustering, cont. Genome 373 Genomic Informatics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Clustering, cont. Genome 373 Genomic Informatics Elhanan Borenstein. Some slides adapted from Jacques van Helden Clustering, cont Genome 373 Genomic Informatics Elhanan Borenstein Some slides adapted from Jacques van Helden Improving the search heuristic: Multiple starting points Simulated annealing Genetic algorithms

More information

Mean Tests & X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242

Mean Tests & X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242 Mean Tests & X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242 Creation & Description of a Data Set * 4 Levels of Measurement * Nominal, ordinal, interval, ratio * Variable Types

More information

NON-CENTRALIZED DISTINCT L-DIVERSITY

NON-CENTRALIZED DISTINCT L-DIVERSITY NON-CENTRALIZED DISTINCT L-DIVERSITY Chi Hong Cheong 1, Dan Wu 2, and Man Hon Wong 3 1,3 Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong {chcheong, mhwong}@cse.cuhk.edu.hk

More information

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data. Code No: M0502/R05 Set No. 1 1. (a) Explain data mining as a step in the process of knowledge discovery. (b) Differentiate operational database systems and data warehousing. [8+8] 2. (a) Briefly discuss

More information