PrivApprox. Privacy- Preserving Stream Analytics.

PrivApprox Privacy- Preserving Stream Analytics https://privapprox.github.io Do Le Quoc, Martin Beck, Pramod Bhatotia, Ruichuan Chen, Christof Fetzer, Thorsten Strufe July 2017

Motivation Clients Analysts Private data Recommendation, Ads 1

Motivation Clients Analysts Private data Recommendation, Ads Strong privacy guarantee 1

Motivation Clients Analysts Private data Recommendation, Ads Strong privacy guarantee High utility analytics in real- time 1

Motivation Clients Analysts Private data Recommendation, Ads Strong privacy guarantee High utility analytics in real- time How to preserve users privacy while supporting high- utility data analytics for low- latency stream processing? 1

State- of- the- art systems Clients 2

State- of- the- art systems Clients Personal data should be stored locally under the clients control 2

State- of- the- art systems Clients 2

State- of- the- art systems Clients Aggregator Analyst 2

State- of- the- art systems Clients Forward query Aggregator Submit query Analyst 2

State- of- the- art systems Clients Aggregator Analyst 2

State- of- the- art systems Clients Answer query Aggregator Analyst 2

State- of- the- art systems Clients Answer query Aggregator Add noise Analyst 2

State- of- the- art systems Clients Answer query Aggregator Add noise Privacy- preserving output Analyst 2

State- of- the- art systems Clients Answer query Aggregator Add noise Privacy- preserving output Analyst Differential Privacy 2

State- of- the- art systems Clients Answer query Aggregator Add noise Privacy- preserving output Analyst Differential Privacy Limitations: 2

State- of- the- art systems Clients Answer query Aggregator Add noise Privacy- preserving output Analyst Differential Privacy Limitations: Deal with only single- shot batch queries L 2

State- of- the- art systems Clients Answer query Aggregator Add noise Privacy- preserving output Analyst Differential Privacy Limitations: Deal with only single- shot batch queries L Require synchronization between system components L 2

PrivApprox Clients PrivApprox Analyst 3

PrivApprox Clients PrivApprox Analyst PrivApprox: 3

PrivApprox Clients PrivApprox Analyst PrivApprox: Supports stream processing with low latency J 3

PrivApprox Clients PrivApprox Analyst PrivApprox: Supports stream processing with low latency J Enables a truly synchronization- free distributed architecture J 3

PrivApprox Clients PrivApprox Analyst PrivApprox: Supports stream processing with low latency J Enables a truly synchronization- free distributed architecture J Requires lower trust in aggregator J 3

Outline Motivation Overview Design Evaluation 4

System overview Clients PrivApprox Analyst 5

System overview Clients PrivApprox (Query, budget) Analyst 5

System overview Clients PrivApprox (Query, budget) Analyst Execution budget: Latency/throughput guarantees Desired computing resources for query processing Desired accuracy 5

System overview Clients PrivApprox Analyst 5

System overview Clients PrivApprox Result Analyst 5

System overview Clients PrivApprox Approximate computing Result Analyst 5

System overview Clients PrivApprox Approximate computing Result Analyst Low latency 5

System overview Clients PrivApprox Approximate computing + Randomized response Result Analyst Low latency 5

System overview Clients PrivApprox Approximate computing + Randomized response Result Analyst Low latency Privacy 5

System overview Clients PrivApprox Approximate computing + Randomized response Result Analyst 5

System overview Clients PrivApprox Approximate computing + Randomized response Result Analyst Zero- knowledge Privacy 5

System overview Clients PrivApprox Approximate computing + Randomized response Result Analyst Zero- knowledge Privacy Zero- knowledge Privacy > Differential Privacy 5

#1: Approximate computing 6

#1: Approximate computing State- of- the- art- systems Compute Add noise (Privacy- preserving) approximate output 6

#1: Approximate computing State- of- the- art- systems Compute Add noise (Privacy- preserving) approximate output Idea: To achieve low latency, compute over a sub- set of data items instead of the entire data- set 6

#2: Randomized response 7

#2: Randomized response Idea: To preserve privacy, clients may not need to provide truthful answers every time 7

#2: Randomized response Idea: To preserve privacy, clients may not need to provide truthful answers every time Client 7

#2: Randomized response Idea: To preserve privacy, clients may not need to provide truthful answers every time Client Truthful Answer 7

#2: Randomized response Idea: To preserve privacy, clients may not need to provide truthful answers every time Client No Truthful Answer Yes 7

#2: Randomized response Idea: To preserve privacy, clients may not need to provide truthful answers every time Client No Truthful Answer Yes Provides plausible deniability for clients responding to sensitive queries; achieves differential privacy (RAPPOR [CCS 14]) 7

Outline Motivation Overview Design Evaluation 8

Query model 9

Query model Divide answer s value range into buckets, enforce a binary answer in each bucket 9

Query model Divide answer s value range into buckets, enforce a binary answer in each bucket Query: SELECT age FROM clients WHERE city = Santa Clara 9

Query model Divide answer s value range into buckets, enforce a binary answer in each bucket Query: SELECT age FROM clients WHERE city = Santa Clara 1-20 21-30 31-40 41-50 51-60 >60 9

Query model Divide answer s value range into buckets, enforce a binary answer in each bucket Query: SELECT age FROM clients WHERE city = Santa Clara Age: 31 0 0 1 0 0 0 1-20 21-30 31-40 41-50 51-60 >60 9

Workflow: Submit query Aggregator (Query, budget) Analyst 10

Workflow: Submit query Clients Aggregator (Query, budget) Analyst Cost- Function(budget) System parameters: Sampling parameter Randomized response parameters 10

Workflow: Submit query Clients (Query, parameters) Aggregator (Query, budget) Analyst Cost- Function(budget) System parameters: Sampling parameter Randomized response parameters 10

Workflow: Answer query 11

Workflow: Answer query Client 11

Workflow: Answer query Client Step #1 Sampling (Flip a coin to decide to answer query or not) 11

Workflow: Answer query Client Step #1 Step #2 Sampling (Flip a coin to decide to answer query or not) Randomized Response 11

Workflow: Answer query Client Step #1 Step #2 Step #3 Sampling (Flip a coin to decide to answer query or not) Randomized Response Send randomized answer 11

Workflow: Answer query Client Step #1 Step #2 Step #3 Sampling (Flip a coin to decide to answer query or not) Randomized Response Send randomized answer Zero- knowledge privacy 11

Workflow: Answer query Client Step #1 Step #2 Step #3 Sampling (Flip a coin to decide to answer query or not) Randomized Response Send randomized answer Zero- knowledge privacy See the paper for details! 11

Workflow: Answer query Clients Randomized answers Aggregator 12

Workflow: Answer query Clients Randomized answers Aggregator Approximate result ± Error bound Analyst 12

Workflow: Answer query Clients Randomized answers Aggregator Approximate result ± Error bound Analyst Lack of anonymity and unlinkability? 12

#3: Anonymity and unlinkability 13

#3: Anonymity and unlinkability Idea: XOR- based Encryption 13

#3: Anonymity and unlinkability Idea: XOR- based Encryption Client 13

#3: Anonymity and unlinkability Idea: XOR- based Encryption Client Encrypt answer M: GenerateKey - > M k M XOR M k - > M E 13

#3: Anonymity and unlinkability Idea: XOR- based Encryption Client Proxy Aggregator Proxy Encrypt answer M: GenerateKey - > M k M XOR M k - > M E 13

#3: Anonymity and unlinkability Idea: XOR- based Encryption Client Proxy Aggregator Proxy Encrypt answer M: GenerateKey - > M k M XOR M k - > M E Decrypt answer M E : M E XOR M k - > M 13

Implementation Clients Proxy Aggregator Analyst Proxy 14

Outline Motivation Overview Design Evaluation 15

Experimental setup Evaluation questions Utility vs privacy Throughput & latency Network overhead 16

Experimental setup Evaluation questions Utility vs privacy Throughput & latency Network overhead See the paper for more results! 16

Experimental setup Evaluation questions Utility vs privacy Throughput & latency Network overhead See the paper for more results! Testbed Cluster: 44 nodes Dataset: NYC Taxi ride records, household electricity usage 16

Accuracy vs privacy 17

Accuracy vs privacy 0.6 Randomization parameters #1 (p = 0.6, q = 0.6) Randomization parameters #2 (p = 0.9, q = 0.6) 6 Accuracy loss (%) 0.5 0.4 0.3 0.2 0.1 5 4 3 2 1 Privacy (ε zk ) 0 10 20 40 60 80 90 Sampling Fraction (%) 0 17

Accuracy vs privacy 0.6 Randomization parameters #1 (p = 0.6, q = 0.6) Randomization parameters #2 (p = 0.9, q = 0.6) 6 The lower the better Accuracy loss (%) 0.5 0.4 0.3 0.2 Accuracy loss Privacy level 5 4 3 2 Privacy (ε zk ) 0.1 1 0 10 20 40 60 80 90 Sampling Fraction (%) Trade- off between utility and privacy 0 17

Throughput 18

Throughput NYC Taxi Ride Household Electricity Throughput (K) 2500 2000 1500 1000 500 0 1 5 10 15 20 #nodes 18

Throughput NYC Taxi Ride Household Electricity The higher the better Throughput (K) 2500 2000 1500 1000 500 0 1 5 10 15 20 #nodes 18

Throughput NYC Taxi Ride Household Electricity The higher the better Throughput (K) 2500 2000 1500 1000 500 0 1 5 10 15 20 #nodes ~8X speedup when going from one node to 20 nodes 18

Latency 19

Latency NYC Taxi Ride Household Electricity Total processing time (seconds) 2000 1500 1000 500 0 10 20 40 60 80 90 Native Sampling fraction (%) 19

Latency NYC Taxi Ride Household Electricity The lower the better Total processing time (seconds) 2000 1500 1000 500 0 10 20 40 60 80 90 Native Sampling fraction (%) 19

Latency NYC Taxi Ride Household Electricity The lower the better Total processing time (seconds) 2000 1500 1000 500 0 10 20 40 60 80 90 Native Sampling fraction (%) ~1.66X lower than the native execution with sampling fraction of 60% 19

Network overhead 20

Network overhead NYC Taxi Ride Household Electricity Network traffic (GB) 600 500 400 300 200 100 0 10 20 40 60 80 90 Native Sampling fraction (%) 20

Network overhead NYC Taxi Ride Household Electricity The lower the better Network traffic (GB) 600 500 400 300 200 100 0 10 20 40 60 80 90 Native Sampling fraction (%) 20

Network overhead NYC Taxi Ride Household Electricity The lower the better Network traffic (GB) 600 500 400 300 200 100 0 10 20 40 60 80 90 Native Sampling fraction (%) ~1.6X lower than the native execution with sampling fraction of 60% 20

Conclusion PrivApprox: a privacy- preserving stream analytics system over distributed datasets 21

Conclusion PrivApprox: a privacy- preserving stream analytics system over distributed datasets Privacy Zero- knowledge privacy 21

Conclusion PrivApprox: a privacy- preserving stream analytics system over distributed datasets Privacy Practical Zero- knowledge privacy Adaptive execution based on query budget 21

Conclusion PrivApprox: a privacy- preserving stream analytics system over distributed datasets Privacy Practical Efficient Zero- knowledge privacy Adaptive execution based on query budget Randomized response & sampling techniques 21