PrivApprox Privacy- Preserving Stream Analytics https://privapprox.github.io Do Le Quoc, Martin Beck, Pramod Bhatotia, Ruichuan Chen, Christof Fetzer, Thorsten Strufe July 2017
Motivation Clients Analysts Private data Recommendation, Ads 1
Motivation Clients Analysts Private data Recommendation, Ads Strong privacy guarantee 1
Motivation Clients Analysts Private data Recommendation, Ads Strong privacy guarantee High utility analytics in real- time 1
Motivation Clients Analysts Private data Recommendation, Ads Strong privacy guarantee High utility analytics in real- time How to preserve users privacy while supporting high- utility data analytics for low- latency stream processing? 1
State- of- the- art systems Clients 2
State- of- the- art systems Clients Personal data should be stored locally under the clients control 2
State- of- the- art systems Clients Personal data should be stored locally under the clients control 2
State- of- the- art systems Clients 2
State- of- the- art systems Clients Aggregator Analyst 2
State- of- the- art systems Clients Forward query Aggregator Submit query Analyst 2
State- of- the- art systems Clients Aggregator Analyst 2
State- of- the- art systems Clients Answer query Aggregator Analyst 2
State- of- the- art systems Clients Answer query Aggregator Add noise Analyst 2
State- of- the- art systems Clients Answer query Aggregator Add noise Privacy- preserving output Analyst 2
State- of- the- art systems Clients Answer query Aggregator Add noise Privacy- preserving output Analyst Differential Privacy 2
State- of- the- art systems Clients Answer query Aggregator Add noise Privacy- preserving output Analyst Differential Privacy Limitations: 2
State- of- the- art systems Clients Answer query Aggregator Add noise Privacy- preserving output Analyst Differential Privacy Limitations: Deal with only single- shot batch queries L 2
State- of- the- art systems Clients Answer query Aggregator Add noise Privacy- preserving output Analyst Differential Privacy Limitations: Deal with only single- shot batch queries L Require synchronization between system components L 2
State- of- the- art systems Clients Answer query Aggregator Add noise Privacy- preserving output Analyst Differential Privacy Limitations: Deal with only single- shot batch queries L Require synchronization between system components L Require a trusted aggregator L 2
PrivApprox Clients PrivApprox Analyst 3
PrivApprox Clients PrivApprox Analyst PrivApprox: 3
PrivApprox Clients PrivApprox Analyst PrivApprox: Supports stream processing with low latency J 3
PrivApprox Clients PrivApprox Analyst PrivApprox: Supports stream processing with low latency J Enables a truly synchronization- free distributed architecture J 3
PrivApprox Clients PrivApprox Analyst PrivApprox: Supports stream processing with low latency J Enables a truly synchronization- free distributed architecture J Requires lower trust in aggregator J 3
Outline Motivation Overview Design Evaluation 4
System overview Clients PrivApprox Analyst 5
System overview Clients PrivApprox (Query, budget) Analyst 5
System overview Clients PrivApprox (Query, budget) Analyst Execution budget: Latency/throughput guarantees Desired computing resources for query processing Desired accuracy 5
System overview Clients PrivApprox (Query, budget) Analyst Execution budget: Latency/throughput guarantees Desired computing resources for query processing Desired accuracy 5
System overview Clients PrivApprox Analyst 5
System overview Clients PrivApprox Result Analyst 5
System overview Clients PrivApprox Approximate computing Result Analyst 5
System overview Clients PrivApprox Approximate computing Result Analyst Low latency 5
System overview Clients PrivApprox Approximate computing + Randomized response Result Analyst Low latency 5
System overview Clients PrivApprox Approximate computing + Randomized response Result Analyst Low latency Privacy 5
System overview Clients PrivApprox Approximate computing + Randomized response Result Analyst 5
System overview Clients PrivApprox Approximate computing + Randomized response Result Analyst Zero- knowledge Privacy 5
System overview Clients PrivApprox Approximate computing + Randomized response Result Analyst Zero- knowledge Privacy Zero- knowledge Privacy > Differential Privacy 5
#1: Approximate computing 6
#1: Approximate computing State- of- the- art- systems Compute Add noise (Privacy- preserving) approximate output 6
#1: Approximate computing State- of- the- art- systems Compute Add noise (Privacy- preserving) approximate output Idea: To achieve low latency, compute over a sub- set of data items instead of the entire data- set 6
#1: Approximate computing State- of- the- art- systems Compute Add noise (Privacy- preserving) approximate output Idea: To achieve low latency, compute over a sub- set of data items instead of the entire data- set Take a sample Approximate computing Compute Approximate output ± Error bound 6
#2: Randomized response 7
#2: Randomized response Idea: To preserve privacy, clients may not need to provide truthful answers every time 7
#2: Randomized response Idea: To preserve privacy, clients may not need to provide truthful answers every time Client 7
#2: Randomized response Idea: To preserve privacy, clients may not need to provide truthful answers every time Client 7
#2: Randomized response Idea: To preserve privacy, clients may not need to provide truthful answers every time Client Truthful Answer 7
#2: Randomized response Idea: To preserve privacy, clients may not need to provide truthful answers every time Client Truthful Answer 7
#2: Randomized response Idea: To preserve privacy, clients may not need to provide truthful answers every time Client No Truthful Answer Yes 7
#2: Randomized response Idea: To preserve privacy, clients may not need to provide truthful answers every time Client No Truthful Answer Yes Provides plausible deniability for clients responding to sensitive queries; achieves differential privacy (RAPPOR [CCS 14]) 7
Outline Motivation Overview Design Evaluation 8
Query model 9
Query model Divide answer s value range into buckets, enforce a binary answer in each bucket 9
Query model Divide answer s value range into buckets, enforce a binary answer in each bucket Query: SELECT age FROM clients WHERE city = Santa Clara 9
Query model Divide answer s value range into buckets, enforce a binary answer in each bucket Query: SELECT age FROM clients WHERE city = Santa Clara 1-20 21-30 31-40 41-50 51-60 >60 9
Query model Divide answer s value range into buckets, enforce a binary answer in each bucket Query: SELECT age FROM clients WHERE city = Santa Clara Age: 31 0 0 1 0 0 0 1-20 21-30 31-40 41-50 51-60 >60 9
Query model Divide answer s value range into buckets, enforce a binary answer in each bucket Query: SELECT age FROM clients WHERE city = Santa Clara Age: 31 0 0 1 0 0 0 1-20 21-30 31-40 41-50 51-60 >60 Client cannot arbitrarily manipulate answers 9
Workflow: Submit query Aggregator (Query, budget) Analyst 10
Workflow: Submit query Clients Aggregator (Query, budget) Analyst Cost- Function(budget) System parameters: Sampling parameter Randomized response parameters 10
Workflow: Submit query Clients (Query, parameters) Aggregator (Query, budget) Analyst Cost- Function(budget) System parameters: Sampling parameter Randomized response parameters 10
Workflow: Answer query 11
Workflow: Answer query Client 11
Workflow: Answer query Client Step #1 Sampling (Flip a coin to decide to answer query or not) 11
Workflow: Answer query Client Step #1 Step #2 Sampling (Flip a coin to decide to answer query or not) Randomized Response 11
Workflow: Answer query Client Step #1 Step #2 Step #3 Sampling (Flip a coin to decide to answer query or not) Randomized Response Send randomized answer 11
Workflow: Answer query Client Step #1 Step #2 Step #3 Sampling (Flip a coin to decide to answer query or not) Randomized Response Send randomized answer Zero- knowledge privacy 11
Workflow: Answer query Client Step #1 Step #2 Step #3 Sampling (Flip a coin to decide to answer query or not) Randomized Response Send randomized answer Zero- knowledge privacy See the paper for details! 11
Workflow: Answer query Clients Randomized answers Aggregator 12
Workflow: Answer query Clients Randomized answers Aggregator Approximate result ± Error bound Analyst 12
Workflow: Answer query Clients Randomized answers Aggregator Approximate result ± Error bound Analyst Lack of anonymity and unlinkability? 12
#3: Anonymity and unlinkability 13
#3: Anonymity and unlinkability Idea: XOR- based Encryption 13
#3: Anonymity and unlinkability Idea: XOR- based Encryption Client 13
#3: Anonymity and unlinkability Idea: XOR- based Encryption Client Encrypt answer M: GenerateKey - > M k M XOR M k - > M E 13
#3: Anonymity and unlinkability Idea: XOR- based Encryption Client Proxy Aggregator Proxy Encrypt answer M: GenerateKey - > M k M XOR M k - > M E 13
#3: Anonymity and unlinkability Idea: XOR- based Encryption Client Proxy Aggregator Proxy Encrypt answer M: GenerateKey - > M k M XOR M k - > M E Decrypt answer M E : M E XOR M k - > M 13
Implementation Clients Proxy Aggregator Analyst Proxy 14
Implementation Clients Proxy Aggregator Analyst Proxy 14
Implementation Clients Proxy Aggregator Analyst Proxy 14
Implementation Clients Proxy Aggregator Analyst Proxy 14
Outline Motivation Overview Design Evaluation 15
Experimental setup Evaluation questions Utility vs privacy Throughput & latency Network overhead 16
Experimental setup Evaluation questions Utility vs privacy Throughput & latency Network overhead See the paper for more results! 16
Experimental setup Evaluation questions Utility vs privacy Throughput & latency Network overhead See the paper for more results! Testbed Cluster: 44 nodes Dataset: NYC Taxi ride records, household electricity usage 16
Accuracy vs privacy 17
Accuracy vs privacy 0.6 Randomization parameters #1 (p = 0.6, q = 0.6) Randomization parameters #2 (p = 0.9, q = 0.6) 6 Accuracy loss (%) 0.5 0.4 0.3 0.2 0.1 5 4 3 2 1 Privacy (ε zk ) 0 10 20 40 60 80 90 Sampling Fraction (%) 0 17
Accuracy vs privacy 0.6 Randomization parameters #1 (p = 0.6, q = 0.6) Randomization parameters #2 (p = 0.9, q = 0.6) 6 The lower the better Accuracy loss (%) 0.5 0.4 0.3 0.2 Accuracy loss Privacy level 5 4 3 2 Privacy (ε zk ) 0.1 1 0 10 20 40 60 80 90 Sampling Fraction (%) Trade- off between utility and privacy 0 17
Accuracy vs privacy 0.6 Randomization parameters #1 (p = 0.6, q = 0.6) Randomization parameters #2 (p = 0.9, q = 0.6) 6 The lower the better Accuracy loss (%) 0.5 0.4 0.3 0.2 Accuracy loss Privacy level 5 4 3 2 Privacy (ε zk ) 0.1 1 0 10 20 40 60 80 90 Sampling Fraction (%) Trade- off between utility and privacy 0 17
Accuracy vs privacy 0.6 Randomization parameters #1 (p = 0.6, q = 0.6) Randomization parameters #2 (p = 0.9, q = 0.6) 6 The lower the better Accuracy loss (%) 0.5 0.4 0.3 0.2 Accuracy loss Privacy level 5 4 3 2 Privacy (ε zk ) 0.1 1 0 10 20 40 60 80 90 Sampling Fraction (%) Trade- off between utility and privacy 0 17
Accuracy vs privacy 0.6 Randomization parameters #1 (p = 0.6, q = 0.6) Randomization parameters #2 (p = 0.9, q = 0.6) 6 The lower the better Accuracy loss (%) 0.5 0.4 0.3 0.2 Accuracy loss Privacy level 5 4 3 2 Privacy (ε zk ) 0.1 1 0 10 20 40 60 80 90 Sampling Fraction (%) Trade- off between utility and privacy 0 17
Accuracy vs privacy 0.6 Randomization parameters #1 (p = 0.6, q = 0.6) Randomization parameters #2 (p = 0.9, q = 0.6) 6 The lower the better Accuracy loss (%) 0.5 0.4 0.3 0.2 Accuracy loss Privacy level 5 4 3 2 Privacy (ε zk ) 0.1 1 0 10 20 40 60 80 90 Sampling Fraction (%) Trade- off between utility and privacy 0 17
Throughput 18
Throughput NYC Taxi Ride Household Electricity Throughput (K) 2500 2000 1500 1000 500 0 1 5 10 15 20 #nodes 18
Throughput NYC Taxi Ride Household Electricity The higher the better Throughput (K) 2500 2000 1500 1000 500 0 1 5 10 15 20 #nodes 18
Throughput NYC Taxi Ride Household Electricity The higher the better Throughput (K) 2500 2000 1500 1000 500 0 1 5 10 15 20 #nodes 18
Throughput NYC Taxi Ride Household Electricity The higher the better Throughput (K) 2500 2000 1500 1000 500 0 1 5 10 15 20 #nodes ~8X speedup when going from one node to 20 nodes 18
Latency 19
Latency NYC Taxi Ride Household Electricity Total processing time (seconds) 2000 1500 1000 500 0 10 20 40 60 80 90 Native Sampling fraction (%) 19
Latency NYC Taxi Ride Household Electricity The lower the better Total processing time (seconds) 2000 1500 1000 500 0 10 20 40 60 80 90 Native Sampling fraction (%) 19
Latency NYC Taxi Ride Household Electricity The lower the better Total processing time (seconds) 2000 1500 1000 500 0 10 20 40 60 80 90 Native Sampling fraction (%) 19
Latency NYC Taxi Ride Household Electricity The lower the better Total processing time (seconds) 2000 1500 1000 500 0 10 20 40 60 80 90 Native Sampling fraction (%) ~1.66X lower than the native execution with sampling fraction of 60% 19
Network overhead 20
Network overhead NYC Taxi Ride Household Electricity Network traffic (GB) 600 500 400 300 200 100 0 10 20 40 60 80 90 Native Sampling fraction (%) 20
Network overhead NYC Taxi Ride Household Electricity The lower the better Network traffic (GB) 600 500 400 300 200 100 0 10 20 40 60 80 90 Native Sampling fraction (%) 20
Network overhead NYC Taxi Ride Household Electricity The lower the better Network traffic (GB) 600 500 400 300 200 100 0 10 20 40 60 80 90 Native Sampling fraction (%) 20
Network overhead NYC Taxi Ride Household Electricity The lower the better Network traffic (GB) 600 500 400 300 200 100 0 10 20 40 60 80 90 Native Sampling fraction (%) ~1.6X lower than the native execution with sampling fraction of 60% 20
Conclusion PrivApprox: a privacy- preserving stream analytics system over distributed datasets 21
Conclusion PrivApprox: a privacy- preserving stream analytics system over distributed datasets Privacy Zero- knowledge privacy 21
Conclusion PrivApprox: a privacy- preserving stream analytics system over distributed datasets Privacy Practical Zero- knowledge privacy Adaptive execution based on query budget 21
Conclusion PrivApprox: a privacy- preserving stream analytics system over distributed datasets Privacy Practical Efficient Zero- knowledge privacy Adaptive execution based on query budget Randomized response & sampling techniques 21
Conclusion PrivApprox: a privacy- preserving stream analytics system over distributed datasets Privacy Practical Efficient Zero- knowledge privacy Adaptive execution based on query budget Randomized response & sampling techniques Thank you! https://privapprox.github.io 21