Security Control Methods for Statistical Database

Similar documents
Statistical Databases: Query Restriction

Implementation of Aggregate Function in Multi Dimension Privacy Preservation Algorithms for OLAP

Partition Based Perturbation for Privacy Preserving Distributed Data Mining

Privacy Preserving Data Mining. Danushka Bollegala COMP 527

Anonymization Algorithms - Microaggregation and Clustering

SECURITY IN COMPUTING, FIFTH EDITION

CS573 Data Privacy and Security. Li Xiong

Privacy, Security & Ethical Issues

FMC: An Approach for Privacy Preserving OLAP

An Efficient Clustering Method for k-anonymization

Query Auditing for Protecting Sensitive Attributes in Statistical Databases

K ANONYMITY. Xiaoyong Zhou

Introduction to Data Mining

Privacy Preserving Data Publishing: From k-anonymity to Differential Privacy. Xiaokui Xiao Nanyang Technological University

CSE 565 Computer Security Fall 2018

CS 161 Multilevel & Database Security. Military models of security

Approaches to distributed privacy protecting data mining

Record Linkage using Probabilistic Methods and Data Mining Techniques

CS573 Data Privacy and Security. Differential Privacy. Li Xiong

RippleMatch Privacy Policy

K-Anonymity and Other Cluster- Based Methods. Ge Ruan Oct. 11,2007

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

Co-clustering for differentially private synthetic data generation

Synthetic Data. Michael Lin

Sensitive Data and Database Inference

Introduction To Security and Privacy Einführung in die IT-Sicherheit I

A Review on Privacy Preserving Data Mining Approaches

Incognito: Efficient Full Domain K Anonymity

COMP 465: Data Mining Still More on Clustering

CS419 Spring Computer Security. Vinod Ganapathy Lecture 15. Chapter 5: Database security

A Disclosure Avoidance Research Agenda

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing

Data Anonymization - Generalization Algorithms

ECLT 5810 Clustering

Mixture Models and the EM Algorithm

Comparative Evaluation of Synthetic Dataset Generation Methods

A FUZZY BASED APPROACH FOR PRIVACY PRESERVING CLUSTERING

ECLT 5810 Clustering

IAT 355 Visual Analytics. Data and Statistical Models. Lyn Bartram

Using Machine Learning to Optimize Storage Systems

Emerging Measures in Preserving Privacy for Publishing The Data

DOT-K: Distributed Online Top-K Elements Algorithm with Extreme Value Statistics

0x1A Great Papers in Computer Security

MHPE 494: Data Analysis. Welcome! The Analytic Process

Accountability in Privacy-Preserving Data Mining

The Two Dimensions of Data Privacy Measures

Auditing a Batch of SQL Queries

3. Data Preprocessing. 3.1 Introduction

2. Data Preprocessing

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4

Data Preprocessing. Slides by: Shree Jaswal

Parallel Composition Revisited

Machine Learning - Clustering. CS102 Fall 2017

Database Design with Entity Relationship Model

CS Introduction to Data Mining Instructor: Abdullah Mueen

HIPAA Federal Security Rule H I P A A

Privacy Challenges in Big Data and Industry 4.0

FLORIDA S PREHOSPITAL EMERGENCY MEDICAL SERVICES TRACKING & REPORTING SYSTEM

PRACTICAL K-ANONYMITY ON LARGE DATASETS. Benjamin Podgursky. Thesis. Submitted to the Faculty of the. Graduate School of Vanderbilt University

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

PRIVACY-PRESERVING MULTI-PARTY DECISION TREE INDUCTION

Data Has Shape. Did you know? Data has Shape! Examples. My Data What do you think the shape of height data for this class looks like?

Pufferfish: A Semantic Approach to Customizable Privacy

ECLT 5810 Data Preprocessing. Prof. Wai Lam

A Survey on: Privacy Preserving Mining Implementation Techniques

Automated Information Retrieval System Using Correlation Based Multi- Document Summarization Method

Fuzzy methods for database protection

Distributed Bottom up Approach for Data Anonymization using MapReduce framework on Cloud

AN EFFECTIVE FRAMEWORK FOR EXTENDING PRIVACY- PRESERVING ACCESS CONTROL MECHANISM FOR RELATIONAL DATA

Managing Data Resources

CSE 880:Database Systems. ER Model and Relation Schemas

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES

Privacy in Statistical Databases

CS434 Notebook. April 19. Data Mining and Data Warehouse

Secure Multiparty Computation Introduction to Privacy Preserving Distributed Data Mining

Workload Characterization Techniques

Survey of Anonymity Techniques for Privacy Preserving

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Mining Social Network Graphs

Social Data Exploration

IEEE 2013 JAVA PROJECTS Contact No: KNOWLEDGE AND DATA ENGINEERING

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Overview of Record Linkage Techniques

Data Security and Privacy. Topic 18: k-anonymity, l-diversity, and t-closeness

Secure Multiparty Computation Multi-round protocols

PSS718 - Data Mining

Survey Result on Privacy Preserving Techniques in Data Publishing

Equation to LaTeX. Abhinav Rastogi, Sevy Harris. I. Introduction. Segmentation.

CSE4334/5334 Data Mining 4 Data and Data Preprocessing. Chengkai Li University of Texas at Arlington Fall 2017

Comparison and Analysis of Anonymization Techniques for Preserving Privacy in Big Data

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

DATA MINING AND WAREHOUSING

Auditing Interval-Based Inference

Data Analysis and Data Science

Overview of Information Security

Jarek Szlichta

Differentially Private Multi- Dimensional Time Series Release for Traffic Monitoring

Supplementary text S6 Comparison studies on simulated data

Evaluating the Classification Accuracy of Data Mining Algorithms for Anonymized Data

EFFICIENT RETRIEVAL OF DATA FROM CLOUD USING DATA PARTITIONING METHOD FOR BANKING APPLICATIONS [RBAC]

Transcription:

Security Control Methods for Statistical Database Li Xiong CS573 Data Privacy and Security

Statistical Database A statistical database is a database which provides statistics on subsets of records OLAP vs. OLTP Statistics may be performed to compute SUM, MEAN, MEDIAN, COUNT, MAX AND MIN of records

Types of Statistical Databases Static a static database is made once and never changes Example: U.S. Census Dynamic changes continuously to reflect real-time data Example: most online research databases

Types of Statistical Databases Centralized one database General purpose like census Decentralized multiple decentralized databases Special purpose like bank, hospital, academia, etc

Access Restriction Databases normally have different access levels for different types of users User ID and passwords are the most common methods for restricting access In a medical database: Doctors/Healthcare Representative full access to information Researchers only access to partial information (e.g. aggregate information) Statistical database: allow query access only to aggregate data, not individual records

Accuracy vs. Confidentiality Accuracy Researchers want to extract accurate and meaningful data Confidentiality Patients, laws and database administrators want to maintain the privacy of patients and the confidentiality of their information

Data Compromise Exact compromise a user is able to determine the exact value of a sensitive attribute of an individual Partial compromise a user is able to obtain an estimator for a sensitive attribute with a bounded variance Positive compromise determine an attribute has a particular value Negative compromise determine an attribute does not have a particular value Relative compromise determine the ranking of some confidential values

Security Methods Query restriction Data perturbation/anonymization Output perturbation

Comparison Query restriction cannot avoid inference, but they accurate responses to valid queries. Data perturbation techniques can prevent inference, but they cannot consistently provide useful query results. Output perturbation has low storage and computational overhead, however, is subject to the inference (averaging effect) and inaccurate results.

Statistical database vs. data anonymization Data anonymization is one technique that can be used to build statistical database Data anonymiztion can be used to release data for other purposes such as mining Other techniques such as query restriction and output purterbation can be used to build statistical database

Evaluation Criteria Security level of protection Statistical quality of information data utility Cost Suitability to numerical and/or categorical attributes Suitability to multiple confidential attributes Suitability to dynamic statistical DBs

Security Exact compromise a user is able to determine the exact value of a sensitive attribute of an individual Partial compromise a user is able to obtain an estimator for a sensitive attribute with a bounded variance Statistical disclosure control require a large number of queries to obtain a small variance of the estimator

Statistical Quality of Information Bias difference between the unperturbed statistic and the expected value of its perturbed estimate Precision variance of the estimators obtained by users Consistency lack of contradictions and paradoxes Contradictions: different responses to same query; average differs from sum/count Paradox: negative count

Cost Implementation cost Processing overhead Amount of education required to enable users to understand the method and make effective use of the SDB

Security Methods Query set restriction Query size control Query set overlap control Query auditing Data perturbation/anonymization Output perturbation

Query Set Size Control A query-set size control limit the number of records that must be in the result set Allows the query results to be displayed only if the size of the query set C satisfies the condition K <= C <= L K where L is the size of the database and K is a parameter that satisfies 0 <= K <= L/2

Query Set Size Control

Tracker Q1: Count ( Sex = Female ) = A Q2: Count ( Sex = Female OR (Age = 42 & Sex = Male & Employer = ABC) ) = B If B = A+1 Q3: Count ( Sex = Female OR (Age = 42 & Sex = Male & Employer = ABC) & Diagnosis = Schizophrenia) Positively or negatively compromised!

Query set size control With query set size control the database can be easily compromised within a frame of 4-5 queries For query set control, if the threshold value k is large, then it will restrict too many queries And still does not guarantee protection from compromise

Query Set Overlap Control Basic idea: successive queries must be checked against the number of common records. If the number of common records in any query exceeds a given threshold, the requested statistic is not released. A query q(c) is only allowed if: X (C) X (D) r, r > 0 Where α is set by the administrator Number of queries needed for a compromise has a lower bound 1 + (K-1)/r

Query-set-overlap control Ineffective for cooperation of several users Statistics for a set and its subset cannot be released limiting usefulness Need to keep user profile High processing overhead every new query compared with all previous ones

Auditing Keeping up-to-date logs of all queries made by each user and check for possible compromise when a new query is issued Excessive computation and storage requirements Efficient methods for special types of queries

Audit Expert (Chin 1982) Query auditing method for SUM queries A SUM query can be considered as a linear equation where is whether record i belongs to the query set, xi is the sensitive value, and q is the query result A set of SUM queries can be thought of as a system of linear equations Maintains the binary matrix representing linearly independent queries and update it when a new query is issued A row with all 0s except for ith column indicates disclosure

Audit Expert Only stores linearly independent queries Not all queries are linearly independent Q1: Sum(Sex=M) Q2: Sum(Sex=M AND Age>20) Q3: Sum(Sex=M AND Age<=20)

Audit Expert O(L2) time complexity Further work reduced to O(L) time and space when number of queries < L Only for SUM queries No restrictions on query set size Maximizing non-confidential information is NP-complete

Auditing recent developments Online auditing Detect and deny queries that violate privacy requirement Denial themselves may implicitly disclose sensitive information Offline auditing Check if a privacy requirement has been violated after the queries have been executed Not to prevent

Security Methods Query set restriction Data perturbation/anonymization Partitioning Cell suppression Microaggregation Data perturbation Output perturbation

Partitioning Cluster individual entities into mutually exclusive subsets, called atomic populations The statistics of these atomic populations constitute the materials

Microaggregation Original Data Averaged Microaggregated Data

Data Perturbation

Security Methods Query set restriction Data perturbation/anonymization Output perturbation Sampling Varying output perturbation Rounding

Output Perturbation Instead of the raw data being transformed as in Data Perturbation, only the output or query results are perturbed The bias problem is less severe than with data perturbation

Output Perturbation Query Results Original Database Noise Added to Results Query Results

Random Sampling Only a sample of the query set (records meeting the requirements of the query) are used to compute and estimate the statistics Must maintain consistency by giving exact same results to the same query Weakness - Logical equivalent queries can result in a different query set consistency issue

Varying output perturbation Apply perturbation on the query set Less bias than data perturbation

Some Comparisons Method Security Richness of Information Costs Query-set set-size size control Low Low 1 Low Query-set set-overlap control Low Low High Auditing Moderate-Low Moderate High Partitioning Moderate-low Moderate-low Low Microaggregation Moderate Moderate Moderate Data Perturbation High High-Moderate Low Varying Output Perturbation Moderate Moderate-low Low Sampling Moderate Moderate-Low Moderate 1 Quality is low because a lot of information can be eliminated if the query does not meet the requirements

Sources Partial slides: http://www.cs.jmu.edu/users/aboutams Adam, Nabil R. ; Wortmann, John C.; Security- Control Methods for Statistical Databases: A Comparative Study; ACM Computing Surveys, Vol. 21, No. 4, December 1989 Fung et al. Privacy Preserving Data Publishing: A Survey of Recent Development, ACM Computing Surveys, in press, 2009