PrivApprox. Privacy- Preserving Stream Analytics.

Similar documents
CS573 Data Privacy and Security. Differential Privacy. Li Xiong

with BLENDER: Enabling Local Search a Hybrid Differential Privacy Model

ApproxJoin: Approximate Distributed Joins

Time Distortion Anonymization for the Publication of Mobility Data with High Utility

Private & Anonymous Communication. Peter Kairouz ECE Department University of Illinois at Urbana-Champaign

Reza Tourani, Satyajayant (Jay) Misra, Travis Mick

Towards Practical Differential Privacy for SQL Queries. Noah Johnson, Joseph P. Near, Dawn Song UC Berkeley

Anonymization for web, fixed line, and mobile applications

Utilizing Large-Scale Randomized Response at Google: RAPPOR and its lessons

Data Privacy in Big Data Applications. Sreagni Banerjee CS-846

Sub-millisecond Stateful Stream Querying over Fast-evolving Linked Data

Differentially Private H-Tree

Safely Measuring Tor. Rob Jansen U.S. Naval Research Laboratory Center for High Assurance Computer Systems

Delegated Access for Hadoop Clusters in the Cloud

Automatic Scaling Iterative Computations. Aug. 7 th, 2012

Efficient Lists Intersection by CPU- GPU Cooperative Computing

Differential Privacy. Seminar: Robust Data Mining Techniques. Thomas Edlich. July 16, 2017

Estimating Quantiles from the Union of Historical and Streaming Data

Policy-Sealed Data: A New Abstraction for Building Trusted Cloud Services

Distributed Private Data Collection at Scale

Web Serving Architectures

The Marriage of Incremental and Approximate Computing

Developing the ERS Collaboration Framework

Differentially Private Multi- Dimensional Time Series Release for Traffic Monitoring

Lambda Architecture for Batch and Stream Processing. October 2018

VPriv: Protecting Privacy in Location- Based Vehicular Services

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Secure Remote Storage Using Oblivious RAM

S-Store: Streaming Meets Transaction Processing

Mahout: Low-Overhead Datacenter Traffic Management using End-Host-Based Elephant Detection. Vasileios Dimitrakis

Securing the Frisbee Multicast Disk Loader

Building Durable Real-time Data Pipeline

Privacy Protected Spatial Query Processing

Safely Measuring Tor. Rob Jansen U.S. Naval Research Laboratory Center for High Assurance Computer Systems

microsoft

Indrajit Roy, Srinath T.V. Setty, Ann Kilzer, Vitaly Shmatikov, Emmett Witchel The University of Texas at Austin

DS504/CS586: Big Data Analytics Data Management Prof. Yanhua Li

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

Scaling Distributed Machine Learning with the Parameter Server

The Confounding Problem of Private Data Release

Guarding user Privacy with Federated Learning and Differential Privacy

Identify and cluster touchpoints in several ways Identify risks and initiatives associated to touchpoints

R-Storm: A Resource-Aware Scheduler for STORM. Mohammad Hosseini Boyang Peng Zhihao Hong Reza Farivar Roy Campbell

Void main Technologies

The Loopix Anonymity System

Dandelion: Privacy-Preserving Transaction Propagation in Bitcoin s P2P Network

Homomorphic Encryption. By Raj Thimmiah

ZHT: Const Eventual Consistency Support For ZHT. Group Member: Shukun Xie Ran Xin

Popular SIEM vs aisiem

Herbivore: An Anonymous Information Sharing System

Sanitization of call detail records via differentially-private Bloom filters

Privacy Challenges in Big Data and Industry 4.0

Differential Privacy. Cynthia Dwork. Mamadou H. Diallo

Unifying Big Data Workloads in Apache Spark

Metrics for Security and Performance in Low-Latency Anonymity Systems

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

Remote Data Checking for Network Codingbased. Distributed Storage Systems

Network Security: Anonymity. Tuomas Aura T Network security Aalto University, autumn 2015

A Relational Platform for Efficient Large-Scale Video Analytics. Yao Lu, Aakanksha Chowdhery, Srikanth Kandula

Hammer Slide: Work- and CPU-efficient Streaming Window Aggregation

Understanding the latent value in all content

Data Storage Infrastructure at Facebook

Privacy-Preserving Machine Learning

Database Learning: Toward a Database that Becomes Smarter Over Time

Protocols for Anonymous Communication

Design and Implementation of Privacy-Preserving Surveillance. Aaron Segal

Computations with Bounded Errors and Response Times on Very Large Data

Heckaton. SQL Server's Memory Optimized OLTP Engine

DJoin: Differentially Private Join Queries over Distributed Databases. University of Pennsylvania

Anonymous communications: Crowds and Tor

PageVault: Securing Off-Chip Memory Using Page-Based Authen?ca?on. Blaise-Pascal Tine Sudhakar Yalamanchili

JAVA IEEE TRANSACTION ON CLOUD COMPUTING. 1. ITJCC01 Nebula: Distributed Edge Cloud for Data Intensive Computing

Towards Energy Proportionality for Large-Scale Latency-Critical Workloads

PrivCount: A Distributed System for Safely Measuring Tor

Blockchain for Enterprise: A Security & Privacy Perspective through Hyperledger/fabric

Security Control Methods for Statistical Database

Service Mesh and Microservices Networking

Privacy-Preserving Sensor Cloud. Hung Dang, Yun Long Chong, Francois Brun, Ee-Chien Chang School of Computing National University of Singapore

Sparrow. Distributed Low-Latency Spark Scheduling. Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica

ReDroid: Prioritizing Data Flows and Sinks for App Security Transformation

CHAPTER 6 SOLUTION TO NETWORK TRAFFIC PROBLEM IN MIGRATING PARALLEL CRAWLERS USING FUZZY LOGIC

HYRISE In-Memory Storage Engine

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

How Alice and Bob meet if they don t like onions

Focus: Querying Large Video Datasets with Low Latency and Low Cost

Privacy-Preserving Computation with Trusted Computing via Scramble-then-Compute

from circuits to RAM programs in malicious-2pc

System Models. 2.1 Introduction 2.2 Architectural Models 2.3 Fundamental Models. Nicola Dragoni Embedded Systems Engineering DTU Informatics

A User-level Secure Grid File System

A Solution for Geographic Regions Load Balancing in Cloud Computing Environment

Differentially-Private Network Trace Analysis. Frank McSherry and Ratul Mahajan Microsoft Research

DROPBOX.COM - PRIVACY POLICY

IP Mobility vs. Session Mobility

Giza: Erasure Coding Objects across Global Data Centers

Maelstrom: An Enterprise Continuity Protocol for Financial Datacenters

Approaches to distributed privacy protecting data mining

Privacy-Preserving Using Data mining Technique in Cloud Computing

CS526: Information security

Privacy Preserving Data Publishing: From k-anonymity to Differential Privacy. Xiaokui Xiao Nanyang Technological University

Transcription:

PrivApprox Privacy- Preserving Stream Analytics https://privapprox.github.io Do Le Quoc, Martin Beck, Pramod Bhatotia, Ruichuan Chen, Christof Fetzer, Thorsten Strufe July 2017

Motivation Clients Analysts Private data Recommendation, Ads 1

Motivation Clients Analysts Private data Recommendation, Ads Strong privacy guarantee 1

Motivation Clients Analysts Private data Recommendation, Ads Strong privacy guarantee High utility analytics in real- time 1

Motivation Clients Analysts Private data Recommendation, Ads Strong privacy guarantee High utility analytics in real- time How to preserve users privacy while supporting high- utility data analytics for low- latency stream processing? 1

State- of- the- art systems Clients 2

State- of- the- art systems Clients Personal data should be stored locally under the clients control 2

State- of- the- art systems Clients Personal data should be stored locally under the clients control 2

State- of- the- art systems Clients 2

State- of- the- art systems Clients Aggregator Analyst 2

State- of- the- art systems Clients Forward query Aggregator Submit query Analyst 2

State- of- the- art systems Clients Aggregator Analyst 2

State- of- the- art systems Clients Answer query Aggregator Analyst 2

State- of- the- art systems Clients Answer query Aggregator Add noise Analyst 2

State- of- the- art systems Clients Answer query Aggregator Add noise Privacy- preserving output Analyst 2

State- of- the- art systems Clients Answer query Aggregator Add noise Privacy- preserving output Analyst Differential Privacy 2

State- of- the- art systems Clients Answer query Aggregator Add noise Privacy- preserving output Analyst Differential Privacy Limitations: 2

State- of- the- art systems Clients Answer query Aggregator Add noise Privacy- preserving output Analyst Differential Privacy Limitations: Deal with only single- shot batch queries L 2

State- of- the- art systems Clients Answer query Aggregator Add noise Privacy- preserving output Analyst Differential Privacy Limitations: Deal with only single- shot batch queries L Require synchronization between system components L 2

State- of- the- art systems Clients Answer query Aggregator Add noise Privacy- preserving output Analyst Differential Privacy Limitations: Deal with only single- shot batch queries L Require synchronization between system components L Require a trusted aggregator L 2

PrivApprox Clients PrivApprox Analyst 3

PrivApprox Clients PrivApprox Analyst PrivApprox: 3

PrivApprox Clients PrivApprox Analyst PrivApprox: Supports stream processing with low latency J 3

PrivApprox Clients PrivApprox Analyst PrivApprox: Supports stream processing with low latency J Enables a truly synchronization- free distributed architecture J 3

PrivApprox Clients PrivApprox Analyst PrivApprox: Supports stream processing with low latency J Enables a truly synchronization- free distributed architecture J Requires lower trust in aggregator J 3

Outline Motivation Overview Design Evaluation 4

System overview Clients PrivApprox Analyst 5

System overview Clients PrivApprox (Query, budget) Analyst 5

System overview Clients PrivApprox (Query, budget) Analyst Execution budget: Latency/throughput guarantees Desired computing resources for query processing Desired accuracy 5

System overview Clients PrivApprox (Query, budget) Analyst Execution budget: Latency/throughput guarantees Desired computing resources for query processing Desired accuracy 5

System overview Clients PrivApprox Analyst 5

System overview Clients PrivApprox Result Analyst 5

System overview Clients PrivApprox Approximate computing Result Analyst 5

System overview Clients PrivApprox Approximate computing Result Analyst Low latency 5

System overview Clients PrivApprox Approximate computing + Randomized response Result Analyst Low latency 5

System overview Clients PrivApprox Approximate computing + Randomized response Result Analyst Low latency Privacy 5

System overview Clients PrivApprox Approximate computing + Randomized response Result Analyst 5

System overview Clients PrivApprox Approximate computing + Randomized response Result Analyst Zero- knowledge Privacy 5

System overview Clients PrivApprox Approximate computing + Randomized response Result Analyst Zero- knowledge Privacy Zero- knowledge Privacy > Differential Privacy 5

#1: Approximate computing 6

#1: Approximate computing State- of- the- art- systems Compute Add noise (Privacy- preserving) approximate output 6

#1: Approximate computing State- of- the- art- systems Compute Add noise (Privacy- preserving) approximate output Idea: To achieve low latency, compute over a sub- set of data items instead of the entire data- set 6

#1: Approximate computing State- of- the- art- systems Compute Add noise (Privacy- preserving) approximate output Idea: To achieve low latency, compute over a sub- set of data items instead of the entire data- set Take a sample Approximate computing Compute Approximate output ± Error bound 6

#2: Randomized response 7

#2: Randomized response Idea: To preserve privacy, clients may not need to provide truthful answers every time 7

#2: Randomized response Idea: To preserve privacy, clients may not need to provide truthful answers every time Client 7

#2: Randomized response Idea: To preserve privacy, clients may not need to provide truthful answers every time Client 7

#2: Randomized response Idea: To preserve privacy, clients may not need to provide truthful answers every time Client Truthful Answer 7

#2: Randomized response Idea: To preserve privacy, clients may not need to provide truthful answers every time Client Truthful Answer 7

#2: Randomized response Idea: To preserve privacy, clients may not need to provide truthful answers every time Client No Truthful Answer Yes 7

#2: Randomized response Idea: To preserve privacy, clients may not need to provide truthful answers every time Client No Truthful Answer Yes Provides plausible deniability for clients responding to sensitive queries; achieves differential privacy (RAPPOR [CCS 14]) 7

Outline Motivation Overview Design Evaluation 8

Query model 9

Query model Divide answer s value range into buckets, enforce a binary answer in each bucket 9

Query model Divide answer s value range into buckets, enforce a binary answer in each bucket Query: SELECT age FROM clients WHERE city = Santa Clara 9

Query model Divide answer s value range into buckets, enforce a binary answer in each bucket Query: SELECT age FROM clients WHERE city = Santa Clara 1-20 21-30 31-40 41-50 51-60 >60 9

Query model Divide answer s value range into buckets, enforce a binary answer in each bucket Query: SELECT age FROM clients WHERE city = Santa Clara Age: 31 0 0 1 0 0 0 1-20 21-30 31-40 41-50 51-60 >60 9

Query model Divide answer s value range into buckets, enforce a binary answer in each bucket Query: SELECT age FROM clients WHERE city = Santa Clara Age: 31 0 0 1 0 0 0 1-20 21-30 31-40 41-50 51-60 >60 Client cannot arbitrarily manipulate answers 9

Workflow: Submit query Aggregator (Query, budget) Analyst 10

Workflow: Submit query Clients Aggregator (Query, budget) Analyst Cost- Function(budget) System parameters: Sampling parameter Randomized response parameters 10

Workflow: Submit query Clients (Query, parameters) Aggregator (Query, budget) Analyst Cost- Function(budget) System parameters: Sampling parameter Randomized response parameters 10

Workflow: Answer query 11

Workflow: Answer query Client 11

Workflow: Answer query Client Step #1 Sampling (Flip a coin to decide to answer query or not) 11

Workflow: Answer query Client Step #1 Step #2 Sampling (Flip a coin to decide to answer query or not) Randomized Response 11

Workflow: Answer query Client Step #1 Step #2 Step #3 Sampling (Flip a coin to decide to answer query or not) Randomized Response Send randomized answer 11

Workflow: Answer query Client Step #1 Step #2 Step #3 Sampling (Flip a coin to decide to answer query or not) Randomized Response Send randomized answer Zero- knowledge privacy 11

Workflow: Answer query Client Step #1 Step #2 Step #3 Sampling (Flip a coin to decide to answer query or not) Randomized Response Send randomized answer Zero- knowledge privacy See the paper for details! 11

Workflow: Answer query Clients Randomized answers Aggregator 12

Workflow: Answer query Clients Randomized answers Aggregator Approximate result ± Error bound Analyst 12

Workflow: Answer query Clients Randomized answers Aggregator Approximate result ± Error bound Analyst Lack of anonymity and unlinkability? 12

#3: Anonymity and unlinkability 13

#3: Anonymity and unlinkability Idea: XOR- based Encryption 13

#3: Anonymity and unlinkability Idea: XOR- based Encryption Client 13

#3: Anonymity and unlinkability Idea: XOR- based Encryption Client Encrypt answer M: GenerateKey - > M k M XOR M k - > M E 13

#3: Anonymity and unlinkability Idea: XOR- based Encryption Client Proxy Aggregator Proxy Encrypt answer M: GenerateKey - > M k M XOR M k - > M E 13

#3: Anonymity and unlinkability Idea: XOR- based Encryption Client Proxy Aggregator Proxy Encrypt answer M: GenerateKey - > M k M XOR M k - > M E Decrypt answer M E : M E XOR M k - > M 13

Implementation Clients Proxy Aggregator Analyst Proxy 14

Implementation Clients Proxy Aggregator Analyst Proxy 14

Implementation Clients Proxy Aggregator Analyst Proxy 14

Implementation Clients Proxy Aggregator Analyst Proxy 14

Outline Motivation Overview Design Evaluation 15

Experimental setup Evaluation questions Utility vs privacy Throughput & latency Network overhead 16

Experimental setup Evaluation questions Utility vs privacy Throughput & latency Network overhead See the paper for more results! 16

Experimental setup Evaluation questions Utility vs privacy Throughput & latency Network overhead See the paper for more results! Testbed Cluster: 44 nodes Dataset: NYC Taxi ride records, household electricity usage 16

Accuracy vs privacy 17

Accuracy vs privacy 0.6 Randomization parameters #1 (p = 0.6, q = 0.6) Randomization parameters #2 (p = 0.9, q = 0.6) 6 Accuracy loss (%) 0.5 0.4 0.3 0.2 0.1 5 4 3 2 1 Privacy (ε zk ) 0 10 20 40 60 80 90 Sampling Fraction (%) 0 17

Accuracy vs privacy 0.6 Randomization parameters #1 (p = 0.6, q = 0.6) Randomization parameters #2 (p = 0.9, q = 0.6) 6 The lower the better Accuracy loss (%) 0.5 0.4 0.3 0.2 Accuracy loss Privacy level 5 4 3 2 Privacy (ε zk ) 0.1 1 0 10 20 40 60 80 90 Sampling Fraction (%) Trade- off between utility and privacy 0 17

Accuracy vs privacy 0.6 Randomization parameters #1 (p = 0.6, q = 0.6) Randomization parameters #2 (p = 0.9, q = 0.6) 6 The lower the better Accuracy loss (%) 0.5 0.4 0.3 0.2 Accuracy loss Privacy level 5 4 3 2 Privacy (ε zk ) 0.1 1 0 10 20 40 60 80 90 Sampling Fraction (%) Trade- off between utility and privacy 0 17

Accuracy vs privacy 0.6 Randomization parameters #1 (p = 0.6, q = 0.6) Randomization parameters #2 (p = 0.9, q = 0.6) 6 The lower the better Accuracy loss (%) 0.5 0.4 0.3 0.2 Accuracy loss Privacy level 5 4 3 2 Privacy (ε zk ) 0.1 1 0 10 20 40 60 80 90 Sampling Fraction (%) Trade- off between utility and privacy 0 17

Accuracy vs privacy 0.6 Randomization parameters #1 (p = 0.6, q = 0.6) Randomization parameters #2 (p = 0.9, q = 0.6) 6 The lower the better Accuracy loss (%) 0.5 0.4 0.3 0.2 Accuracy loss Privacy level 5 4 3 2 Privacy (ε zk ) 0.1 1 0 10 20 40 60 80 90 Sampling Fraction (%) Trade- off between utility and privacy 0 17

Accuracy vs privacy 0.6 Randomization parameters #1 (p = 0.6, q = 0.6) Randomization parameters #2 (p = 0.9, q = 0.6) 6 The lower the better Accuracy loss (%) 0.5 0.4 0.3 0.2 Accuracy loss Privacy level 5 4 3 2 Privacy (ε zk ) 0.1 1 0 10 20 40 60 80 90 Sampling Fraction (%) Trade- off between utility and privacy 0 17

Throughput 18

Throughput NYC Taxi Ride Household Electricity Throughput (K) 2500 2000 1500 1000 500 0 1 5 10 15 20 #nodes 18

Throughput NYC Taxi Ride Household Electricity The higher the better Throughput (K) 2500 2000 1500 1000 500 0 1 5 10 15 20 #nodes 18

Throughput NYC Taxi Ride Household Electricity The higher the better Throughput (K) 2500 2000 1500 1000 500 0 1 5 10 15 20 #nodes 18

Throughput NYC Taxi Ride Household Electricity The higher the better Throughput (K) 2500 2000 1500 1000 500 0 1 5 10 15 20 #nodes ~8X speedup when going from one node to 20 nodes 18

Latency 19

Latency NYC Taxi Ride Household Electricity Total processing time (seconds) 2000 1500 1000 500 0 10 20 40 60 80 90 Native Sampling fraction (%) 19

Latency NYC Taxi Ride Household Electricity The lower the better Total processing time (seconds) 2000 1500 1000 500 0 10 20 40 60 80 90 Native Sampling fraction (%) 19

Latency NYC Taxi Ride Household Electricity The lower the better Total processing time (seconds) 2000 1500 1000 500 0 10 20 40 60 80 90 Native Sampling fraction (%) 19

Latency NYC Taxi Ride Household Electricity The lower the better Total processing time (seconds) 2000 1500 1000 500 0 10 20 40 60 80 90 Native Sampling fraction (%) ~1.66X lower than the native execution with sampling fraction of 60% 19

Network overhead 20

Network overhead NYC Taxi Ride Household Electricity Network traffic (GB) 600 500 400 300 200 100 0 10 20 40 60 80 90 Native Sampling fraction (%) 20

Network overhead NYC Taxi Ride Household Electricity The lower the better Network traffic (GB) 600 500 400 300 200 100 0 10 20 40 60 80 90 Native Sampling fraction (%) 20

Network overhead NYC Taxi Ride Household Electricity The lower the better Network traffic (GB) 600 500 400 300 200 100 0 10 20 40 60 80 90 Native Sampling fraction (%) 20

Network overhead NYC Taxi Ride Household Electricity The lower the better Network traffic (GB) 600 500 400 300 200 100 0 10 20 40 60 80 90 Native Sampling fraction (%) ~1.6X lower than the native execution with sampling fraction of 60% 20

Conclusion PrivApprox: a privacy- preserving stream analytics system over distributed datasets 21

Conclusion PrivApprox: a privacy- preserving stream analytics system over distributed datasets Privacy Zero- knowledge privacy 21

Conclusion PrivApprox: a privacy- preserving stream analytics system over distributed datasets Privacy Practical Zero- knowledge privacy Adaptive execution based on query budget 21

Conclusion PrivApprox: a privacy- preserving stream analytics system over distributed datasets Privacy Practical Efficient Zero- knowledge privacy Adaptive execution based on query budget Randomized response & sampling techniques 21

Conclusion PrivApprox: a privacy- preserving stream analytics system over distributed datasets Privacy Practical Efficient Zero- knowledge privacy Adaptive execution based on query budget Randomized response & sampling techniques Thank you! https://privapprox.github.io 21