Cost-effective and Privacy-conscious MapRed du uce Cloud serv vices Guest lecture by Dr. Balaji Palanisamy

Size: px

Start display at page:

Download "Cost-effective and Privacy-conscious MapRed du uce Cloud serv vices Guest lecture by Dr. Balaji Palanisamy"

Claude Thompson
6 years ago
Views:

1 Cost-effective and Privacy-conscious MapRedu uce Cloud serv vices Guest lecture by Dr. Balaji Palanisamy

2 omputing Paradigm Shift Cloud Challenges Cost effectiveness, Performance and Security

3 loud Computing Chall lenges Q: How important is it that cloud service providers (scale: 1-5; 1=not at all important, 5=very important) Offer competitive pricing Offer Service Level Agreements Provice a complete solution Understand my business and industry Allow managing on-premise & off-premise premise.. Support manay of my IT needs Offer both private & public clouds Are a technology & business model Have local presence, can come to my 92% 89% 86% 85% 82% 81% 79% 78% 73% Source: IDC Enterprise Panel, 3Q2009, n = 263 0% 20% 40% 60% 80% 100% % responding 3, 4 or 5

4 Cloud Computing Challenges Q: Rate the challenges/issue es of the 'cloud'/on demand model (1=not significant, 5=very significant) Security Performance 88.5% 88.1% Availability Hard to integrate with in house IT Not enough ability to customize Worried cloud will cost more Bringing back in house may be difficult Not enough major suppliers yet 74.6% 84.8% 84.5% 83.3% 81.1% 80.3% Source: IDC Enterprise Panel, % 70% 75% 80% 85% 90% % responding 3, 4 or 5

5 ata Growth Worldwide Corporate Data Growth IT.com analyses the 80% of data growth: unstructured data Exaby ytes Source: IDC, The Digital Universe 2010 Unstructured data Structured data

6 ig Data processing inacloud MapReduce and Big Data Processing Programming model for data intensive computing on commodity clusters Pioneered by Google Processes 20 PB of data per day Scalability to large data volumes Scan 100 TB on 1 50 MB/ /s = 24 days Scan on 1000 node cluster = 35 minutes It is estimated that, by 2015, more than half the world's data will be processed by Hadoop Hortonworks Existing Dedicated MapReduce Clouds Naturalextensionsof of rental store mod dels Performance and cost inefficiency

7 utline Goals: Cost effective and scalable Big Data Processing using MapReduce Privacy conscious Data access and Data processing in a cloud Outline Cost aware Resource Management Goal: to devise resource management techniques that yield the required performance at the lowest cost. Cura Cost optimized Model for MapReduce in a Cloud Performance driven Resource Optimization: Goal: Optimizations based on physical resource limits in a data center such as networking and storage bottleneck s Purlieus Locality aware Resource Allocation for MapReduce Privacy conscious MapReduce Systems VNCache MapReduce analysis for Cloud archived data Airavat: Security and Privacy for MapReduce

8 MapReduce - Data Parallelism li x := (a * b) + (y * z); computation A computation B At the micro level, independent algebraic operations can commute be processed in any order. If commutative operations are applied to different memory addresses, then they can also occur at the same time Compilers, CPUs often do so automatically

9 Higher-level l Parallelism li x := foo(a) + bar(b) computation A computation B Commutativity can apply to larger operations. If foo() and bar() do not manipulate the same memory, then there is no reason why these cannot occur at the same time

10 Parallelism: li Dependency Graphs x := foo(a) + bar(b) foo(a) bar(b) write x Arrows indicate dependent operations If foo and bar do not access the same memory, there is not a dependency between them These operations can occur in parallel in different threads write x operation waits for predecessors to complete 1

11 1 Master/Workers One object called the master initially owns all data. Creates several workers to process individual elements Waits for workers to report results back worker threads master

12 Example: Count word occurrences of each word in a large collect ion of documents map(string input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(string output_key, intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result)); Iterator 1

13 apreduce Executio on Overview User Progra am fork assign map fork Maste er fork assign reduce Input Data Split 0 read Split 1 Split 2 map map map local write remote read, sort reduce reduce write Output File 0 Output File 1 1

14 1 Need for Cost-optimized Cloud Usage Model ExistingPer job Optimized Models: Per job customer side greedy optimization may not be globally optimal Higher cost for customers Cura Usage Model: User submits job and specifies the required service quality in terms of job response time and is charged only for that service quality Cloud provider manages the resources to ensure each job s service requirements Other Cloud Managed Resource models: Database as a Service Model Eg: Relational Cloud (CIDR 2011) Cloud managed model for Resource management Cloud managed SQL like lk query service Delayed query model in Google Big Query execution results in 40% lower cost

15 1 User Scheduling Vs Cloud Job# Arrival time Sched duling Deadline Running time Optimal no. Of VMs User Scheduling Job # Start time concurrent VMs Cloud Scheduling Job # Start time concurrent VMs

16 ura System Architecturecture 1

17 tatic Virtual Machine sets Cluster of physical machines Pool of small instances Pool of Large instances Pool of extra large instances 1

18 1 tatic Partitioning of Virtual Machine sets Cluster of physical machines Pool of small instances Pool of Large instances Pool of extra large instances

19 1 ey Challenges in Cu ura Design Resource Provisioning and Scheduling Optimal scheduling Optimal Cluster Configuration Optimal Hadoop Configuration i Virtual Machine Management Optimal capacity planning Right set of VMs(VM types) for current workload? Minimize Capital expenditure and Opera ational expenses Resource Pricing What is the price of each job based on it ts service quality andjob characteristics?

20 M-aware Job Sched duling Multi bin backfilling Scheduler needs to decide on which instance type to use for the jobs The job scheduler has two major goals: (i) complete all job execution within the deadlines (ii) minimize operating expensee by minimizing resource usage 2

21 M-aware Schedulin g Algorithm Goal: VM aware scheduler hdl decides d () (a) when to schedule hdl each job in the job queue, (b) which VM instance pool to use and (c) how many VMs to use for the jobs. Minimum reservations withoutt under utilizing any resources. Job J i has higher priority over Job J j if the cost of scheduling J i is higher. For each VM pool picks the hig hest priority job, J prior in the job queue and makes a reservation. Subsequently, the scheduler picks the next highest priority jobs in the job queue by considering priority only with respect to the reservations that are possib ble within the current reservation time windows of the VM pools. Runs in O(n 2 ) time Straight forward to obtain a distributed implementation to scale further 2

22 econfiguration-awarawar re Scheduler Assume two new jobs need a cluster of 9 small instances and 4 large instances respectively. The scheduler has the following options: 1) Wait for some other clusters of small instances to complete execution 2) Run the job in a cluster available of extra large instances 3) Convert some large or extra large instances into multiple small instances Reconfiguration unaware Scheduler Reconfiguration aware Scheduler 2

For each job, J i, the optimal cost C opt (J i ) that incurs the lowest resource

23 2 Reconfiguration Algorithm Reconfiguration time window Observation period for collecting history of job executions. For each job, J i, the optimal cost C opt (J i ) that incurs the lowest resource usage is found. The reconfiguration plan reflects the proportion of demands for the optimal instance types. Reconfiguration Benefit:

24 2 umber of servers and Effective tilization Dedicated Cluster Per job Cluster Cura No of Servers Deadline Fig 1. No. of Servers Effective Utilization Dedicated Cluster Per job Cluster Cura Deadline Fig 2. Effective Utilization Cura requires 80% lower resources than conventional cloud models Cura achieves significantly higher resource utilization

25 2 alk Outline Cost optimized Cloud Resource Allocation Cura Cost optimized Model for MapReduce in a Cloud Performance driven Resource Optimization Purlieus Locality aware Resource Allo ocation for MapReduce Privacy conscious Processing of Cloud Data VNCache: Efficient i MapReduce Analysis s for Cloud archived d Data Airavat: Security and Privacy for MapReduce

26 2 apreduce cloud pr ovider - Challenges Different loads on the shared physical infrastructure Computation load: number and size of each VM (CPU, memory) Storage load: amount of input, output and intermediate data Network load: traffic generated during the map, shuffle andreduce phases The network load isof special Large amounts of traffic can be concern withmapreduce generated in the shuffle phase As each reduce task needs to read the output of all map tasks, a sudden explosion of network traffic can significantly deteriorate cloud performance

27 xisting Dedicated MapReduce Clouds Amazon Elastic MapReduce Model Utilize a hosted Hadoop framework running on the Compute Cloud (e.g.: Amazon EC2) and use a separate storage service (e.g.: Amazon S3) Drawbacks Adversely impacts performance as it violates data locality Unnecessary replication leading to highe er costs Processing 100 TB of data using 100 VMs takes 3 hours just to load the data 2

28 2 urlieus: Storage Arc chitecture Semi persistent storage architecture Data is broken up into chunks corresponding to MapReduce blocks stored on a distributed file system of the physical machines Ability to transition dataa stored on physical machines in a seamless manner i.e. without requiring an expl icit data loading step The data on physical machines is seamlessly made available to VMs Uses loopback mounts and VM disk attach

29 Purlieus Architecture and Locality-optimization Goals H 7 H 8 H 9 I 3 A 4 A 5 A 6 G 1 B 1 B 2 B 3 B 4 B 5 B 6 B 8 B 9 K 1 K 2 K 3 B 1 F 2 G 1 G 2 G 3 F 3 C 1 D 1 D 2 C 3 C 4 D 3 D 4 C 2 C 7 C 5 C 6 F 1 C 7 D 5 D 6 F 2 E 1 E 2 E 3 F 3 E 4 E 5 E 6 G 1 G 2 G 3 F 1 H 1 H 2 H 3 I 1 H 4 H 5 H 6 I 2 A 1 A 2 A 3 A 1 J 1 J 2 I 3 I 4

30 ob Specific Locality-awareness Job specific locality cost Job categories Map and Reduce input heavy Sort workload : map input size = map output size Map input heavy Grep workload: large map input and small map output Reduce input heavy Synthetic Dataset Generators: small map input and large map output 30

31 3 ap and Reduce inp ut-heavy jobs Example: sorting a huge dataa set (output size = input size) Both map and reduce localities are important. Map phase: Incorporates map locality by pushing compute towards data Rd Reduce phase: Make communications happen within a set of closely connected nodes

32 Scheduling Map and Reduce-input heavy tasks (map- phase) H 7 H 8 H 9 I 3 A 4 A 5 A 6 G 1 B 1 B 2 B 3 B 4 B 5 B 6 B 8 B 9 K 1 K 2 K 3 B 1 F 2 G 1 G 2 G 3 F 1 F 2 F 3 F 3 C 1 D 1 D 2 C 3 C 4 D 3 D 4 C 2 C 7 C 5 C 6 F 1 C 7 D 5 D 6 F 2 E 1 E 2 E 3 F 3 E 4 E 5 E 6 G 1 G 2 G 3 E 7 H 1 H 2 H 3 I 1 H 4 H 5 H 6 I 2 A 1 A 2 A 3 A 1 J 1 J 2 I 3 I 4

33 Scheduling Map and Reduce-input heavy tasks (reduce- phase) H 7 H 8 H 9 I 3 A 4 A 5 A 6 G 1 B 1 B 2 B 3 B 4 B 5 B 6 B 8 B 9 K 1 K 2 K 3 B 1 F 2 G 1 G 2 G 3 R 3 R 3 R 2 R 1 R 1 R 2 R1 2 R 3 R 1 F 3 C 1 D 1 D 2 C 3 C 4 D 3 D 4 C 2 C 7 C 5 C 6 F 1 C 7 D 5 D 6 F 2 E 1 E 2 E 3 F 3 E 4 E 5 E 6 G 1 G 2 G 3 F 1 H 1 H 2 H 3 I 1 H 4 H 5 H 6 I 2 A 1 A 2 A 3 A 1 J 1 J 2 I 3 I 4

53 GHz Intel Processors. The network is 1 Gbps.

34 3 xperimental Setup Cluster Setup Testbed of 20 CentOS 5.5 physical machines (KVM as the hypervisor) Each physical machine has 16 core 2.53 GHz Intel Processors. The network is 1 Gbps. Nodes are organized in 2 racks, each containing 10 physical machines Job types: By default, each job uses 20 VMs with each VM configured with 4 GB memory and 4 2GHz vcpus.

35 mpact of Reduce-loc cality Running tas sks Runnin ng tasks With only Map locality Time (sec) With both hmap and Rd Reduce locality li Time (sec) Map Shuffle Reduce Map Shuffle Reduce Timeline plotted using Hadoop job_history_summary 35

36 3 ap and Reduce-inp put heavy workload Execut tion Time Locality aware Data Placement Random Data Placement Map Locality Rd Reduce Map and No Locality Locality Reduce Locality VM Placement Fig3. Execution time Normalized Cr ross rack traffic Locality aware Data Placement Random Data Placement Map Locality Rd Reduce Map and No Locality Locality Reduce Locality VM Placement Fig 4. Cross rack rack traffic Performance depends on both Map and Reduce locality! Locality depends not only on compute placement but also on data placement

37 3 alk Outline Cost optimized Cloud Resource Allocation Cura Cost optimized Model for MapReduce in a Cloud Performance driven Resource Optimization Purlieus Locality aware Resource Alloc cation for MapReduce Privacy conscious Processing of Cloud Data VNCache: Efficient i MapReduce Analysi is for Cloud archived d Data Airavat: Security and Privacy for MapReduce

When there is a privacy/security concern Example use case: private log data archived in Clouds Logs

38 3 apreduce Analysis s for Cloud-archived ata Compute Cloud and Storage Cloud can not be collocated always E.g. When there is a privacy/security concern Example use case: private log data archived in Clouds Logs can contain sensitive ii information i n require data to be encrypted before moved to the cloud. Current solutions require the data to be transferred and processed in a secure enterprise cluster.

39 urrent analytics solutions for Cloudrchived Data MapReduce analysis for Cloud-archived data Current analytics platforms at secure enterprise site(e.g. Hadoop/MapReduce) first download data from these public clouds decrypt it and then proces ss it at the secure enterprise site Data needs to be loaded before the jobs can even start processing Duplicates data at both enterprise site and Storage clouds. 3

40 4 urrent analytics solutions for Cloudrchived Data Wan copy time, decrypt time and HDFS load time are predominant factors of job execution time Actual job execution is significantly slowed down because of this overhead

41 NCache: Seamless Data Caching VNCache Key idea to use a virtual HDFS Filesystem image that provides virtual input data files on the target cluster. to provide a virtual existence of the Hadoop datablocks to the HDFS enables jobs to begin execution immediately For HDFS filesystem running on top of the VNCache All data appears to be local though initially the data is at the storage cloud 4

42 NCache: Data flow Data is archived at the public storage cloud as chunks of HDFS block size For job analysis, VNCache seamlessly the storage cloud streams and prefetches data from 4

43 verview of HDFS HDFS primary storage by Hadoop applications The HDFS Namenode manages the file system namespace and regulates access to files by clients The individual id ldt Datanodes Manage the storage attached to the nodes that they run on. 4

44 4 nteraction of HDFS with its underlying ilesystem Namenode at the master node stores the HDFS filesystem, fsimage HDFS filesystem namespace includes the mapping of HDFS files to their constituent blocks Block identifiers Modification i and Access times Datanode stores a set of HDFS blocks uses separate files in the underlying filesystem both metadata and actual blocks

45 HDFS Filesyste em Image layout Consists of an Inode structure for each HDFS file Each Inode has the information for the individual data blocks corresponding to the file 4

46 4 DFS Virtualization VNCache Filesystem FUSE based filesystem used as the underlying filesystem on the Namenode and Datanode. It is a virtual filesystem (similar to /proc on Linux) simulates various files and directories to the HDFS layer placed on it Namenode Filesystem (fsimage) virtualization a virtual HDFS namespace is created with a list of relevant HDFS blocks Datanode virtualization exposes a list of data files corresponding to the HDFS blocks placed on that datanode

47 NCache s Dataflow Dataflow in the VNCache model Two models of operation: Streaming and Prefetching Seamless Streaming: For data block accesses on demand, the VNCache Filesystem streams the data seamlessly from the storage cloud. Cache Prefetching: Cache prefetching obtains the block job execution time. ahead of processing and significantly reduces 4

48 4 ache Prefetching Cache Prefetching Logic Analyzes job input data and p refetches data blocks Carefully considers the order of task processing in the input data Prefetching order task ordering based on decreasing order of file sizes in the job input based on the order of blocks in the HDFS file system image. Distributed Cache Prefetching algorithm Master Prefetcher Makes prefetching decisions and delegates prefetch tasks to slave prefetchers Slave Prefetchers transfer the individual data blocks from the remote cloud

49 4 aching Algorithm Rate adaptivenessadaptiveness monitors the progress of the Hadoop jobs in terms of task processing rate and the HDFS bloc request rate adaptively decides the set of data blocks to prefetch Cache Eviction Policy enforcesminimal caching principle. Keeps only the most immediately required blocks in the cache minimizes the total storage footprint Workflow awareness Workflows may have multiple jobs processing the same input dataset VNCache recognizes it and prefetches those blocks only once uses a persistent workflow aware caching

50 xperimental Setup Cluster Setup Testbed of 20 CentOS 5.5 ph ysical machines (KVM as the hypervisor) Each physical machine has 16 core 2.53 GHz Intel Processors. 5 nodes used as remote Storage cloud servers Individual job Workloads: Grep, Sort, Facebook jobs Workflow based Workloads: Facebook workflows, Tfidf Dataset sizes: 5 GB, 10 GB and 15 GB Network Latencies: 45 ms, 90 ms, and 150 ms using Swim workload generator By default, each job uses 5 VMs with each VM configured with 4 GB memory and 3.5GHz vcpus. 5

51 erformance of Grep workload Fig 5. Execution time VNCache:Streaming model achiev es 42% reduction in execution time VNCache Prefetch optimization further improves it by 30% 5

52 erformance of Face ebook workload Fig 6. Execution time Fig 7. Cache Hit Ratio VNCache gets more than 55% reduction in job execution time Achieves an average cache hit ratio of 43.5% 5

53 5 fidf workflow Fig 9. Execution time Fig 10. Cache Hit Ratio VNCache reduces the execution time by 47% and results in 70% cache hits on average

54 5 mpact of Cache size Grep Workload Fig 15. Execution time Fig 16. Cache Hit Ratio Even with a reasonable cache size, VNC Cache achieves good performance

55 5 alk Outline Cost optimized Cloud Resource Allocation Cura Cost optimized Model for MapReduce in a Cloud Performance driven Resource Optimization Purlieus Locality aware Resource Alloc cation for MapReduce Privacy conscious Processing of Cloud Data VNCache: Efficient i MapReduce Analysis s for Cloud archived d Data Airavat: Security and Privacy for MapReduce (slides adapted from Indrajit Roy et. al, NSDI 2010 paper)

56 omputing in the yea ar 201X Data Illusion of infinite resources Pay y only for resources used Quickly scale up or scale down Indrajit Roy et. al, NSDI

57 rogramming model in year 201X Frameworks available to ease cloud programming MapReduce: Parallel processing on clusters of machines Output Data Map Reduce Indrajit Roy et. al, NSDI 2010 Data mining Genomic computation Social networks 5

58 rogramming model in year 201X Thousands of users upload dtheir da ata Healthcare, shopping transactions, census, click stream Multiple third parties mine the dat ta for better service Example: Healthcare data dt Incentive to contribute: Cheaper insurance policies, new drug research, inventory control in drug gstores Fear: What if someone targets my personal data? Insurance company can find my illne ss and increase premium Indrajit Roy et. al, NSDI

59 rivacy in the year 20 01X? Untrusted MapReduce program Informatio n leak? Output Health Data Indrajit Roy et. al, NSDI 2010 Data mining Genomic computation Social networks 5

60 se de-identification? Ahi Achieves privacy by syntactic ti tran nsformations Scrubbing, k anonymity Insecure against attackers with ext ternal information Privacy fiascoes: AOL search logs, Netflix dataset Run untrusted co de on the original data? How do we ensure privacy of the Indrajit use Roy et. ers? al, NSDI

61 his talk: Airavat Framework for privacy preserving MapReduce computations with untrust ted code. Protected Data Airavat Untruste d Program Airavat is the elephant Indrajit Roy of et. the al, NSDI clouds 2010 (Indian mythology). 6

62 iravat guarantee ounded information leak* about any individual data after performing a MapReduce e computation. Protected Data Airavat Untruste d Program Indrajit Roy et. al, NSDI 2010 *Differential privacy 6

63 ackground: MapRed duce map(k 1,v 1 ) list(k 2,v 2 ) reduce(k 2, list(v 2 )) list(v 2 ) Data 1 Data 2 Output Data 3 Data 4 Map phase Reduce phase Indrajit Roy et. al, NSDI

64 apreduce examplee Map(input) { if (input has ip Pad) print (ipad, 1) } Reduce(key, list(v)) { print (key +, + SUM(v)) } ipad Counts no. of ipads sold Tablet PC ipad (ipad, 2) Laptop SUM Map phase Reduce phase Indrajit Roy et. al, NSDI

65 iravat model Airavat framework runs on the cloud infrastructure Cloud infrastructure: Hardware + VM Airavat: Modified MapReduce + DFS + JVM + SELinux 1 Airavat framework Trusted Indrajit Roy et. al, NSDI 2010 Cloud infrastructure 6

66 iravat model Data provider uploads her data on Airavat Sets up certain privacy parameters 2 Data provider 1 Airavat framework Trusted Indrajit Roy et. al, NSDI 2010 Cloud infrastructure 6

67 iravat model Computation provider writes data mining algorithm Untrusted, possibly malicious 2 Data provider Computation provider Program 3 Output 1 Airavat framework Trusted Indrajit Roy et. al, NSDI 2010 Cloud infrastructure 6

68 hreat model Airavat runs the computa ation, and still protects the privacy of the data providers 2 Data provider Threat Computation provider Program 3 Output 1 Airavat framework Trusted Indrajit Roy et. al, NSDI 2010 Cloud infrastructure 6

69 oadmap What is the programming model? dl? How do we enforce privacy? What computations can be supported in Airavat? Indrajit Roy et. al, NSDI

70 rogramming model Split MapReduce into untruste ed mapper + trusted reducer Limited set of stock reducers MapReduce program for data mining Data No need to audit Untrusted Trusted Mapper Reduce r Airavat Data Indrajit Roy et. al, NSDI

71 rogramming model Need to confine the mappers! Guarantee: Protect the privacy of data providers MapReduce program for data mining Data No need to audit Untrusted Trusted Mapper Reduce r Airavat Data Indrajit Roy et. al, NSDI

72 hallenge 1: Untruste ed mapper Untrusted t mapper code copies dat dt ta, sends it over the network Peter Chris Map Reduce Meg Data Indrajit Roy et. al, NSDI 2010 Leaksusing system resources 7

73 hallenge 2: Untruste ed mapper Output tof the computation tti is also an information channel Peter Chris Map Reduce Output 1 million if Peter bought Vi*gra Meg Data Indrajit Roy et. al, NSDI

74 iravat mechanisms Mandatory access control Differential privacy Prevent leaks through storage channels like network connections, files Prevent leaks through the output of the computation Output Map Reduce Data Indrajit Roy et. al, NSDI

75 ack to the roadmap What is the programming model? dl? How do we enforce privacy? Leaks through system resources Leaksthrough the output Untrusted mapper + Trusted reducer What computations can be suppor rted in Airavat? Indrajit Roy et. al, NSDI

76 iravat confines the untrusted code Untrusted program MapReduce + DFS SELinux Given by the computation tti provider Add mandatory accesss control (MAC) Add MAC policy Airavat Indrajit Roy et. al, NSDI

values, output Malicious code cannot leak labeled data SELinux Data 1 Data 2

77 iravat confines the untrusted code Untrusted program MapReduce + DFS We add d mandatory access control to the MapReduce framework Label input, intermediate values, output Malicious code cannot leak labeled data SELinux Data 1 Data 2 Output Data 3 Access control label MapReduce Indrajit Roy et. al, NSDI

78 iravat confines the untrusted code Untrusted program MapReduce + DFS SELinux SELinux policy to enforce MAC Creates trusted and untrusted domain ns Processes and files are labeled to restrictt interaction Mappers reside in untrusted domain Denied network access, limited file system interaction Indrajit Roy et. al, NSDI

Peter if (input belongs-to Peter) print (ipad, 1000000) Output leaks the

79 ut access control is not enough Labels can prevent the output from been read When can we remove the labels? Peter if (input belongs-to Peter) print (ipad, ) Output leaks the presence of Peter! ipad Tablet PC (ipad, ) ipad Laptop Map phase SUM Indrajit Roy et. al, NSDI 2010 Reduce phase Access control label 7

80 ut access control is not enough Need mechanisms to enforce that the output does not violate an individua al s privacy. Indrajit Roy et. al, NSDI

81 ackground: Differen ntial privacy A mechanism is differentiall ly private if every output is produced with similar probability whether any given input is included or not Cynthia Dwork. Indrajit Differenti Roy et. al, Privacy. NSDI 2010 ICALP

82 ifferential privacy (in ntuition) A mechanism is differentiall ly private if every output is produced with similar probability whether any given input is included or not A B C Output distribution F(x) Cynthia Dwork. Indrajit Differenti Roy et. al, Privacy. NSDI 2010 ICALP

83 ifferential privacy (in ntuition) A mechanism is differentiall ly private if every output is produced with similar probability whether any given input is included or not A B C Similar output distributions F(x) F(x) A B C D Bounded drisk ikfor D if she includes her data! Cynthia Dwork. Indrajit Differenti Roy et. al, Privacy. NSDI 2010 ICALP

84 chieving differential privacy A simple differentially private mec hanism x 1 x n Tell me f(x) () f(x) +noise How much noise should one add? Indrajit Roy et. al, NSDI

85 chieving differential privacy Function sensitivity (intuition): iti Maximum effect of any single input on the output Aim: Need to conceal this effect to preserve privacy Example: Computing the average height of the people in this room has low sensitivity Any single person s height does not affect the final average by too much Calculating the maximum height has high sensitivity Indrajit Roy et. al, NSDI

86 chieving differential privacy Function sensitivity (intuition): iti Maximum effect of any single input on the output Aim: Need to conceal this effect to preserve privacy Example: SUM over input element s drawn from [0, M] X 1 X 2 X 3 X 4 SUM Sensitivity = M Max. effect of any input element is M Indrajit Roy et. al, NSDI

87 chieving differential privacy A simple differentially private mec hanism x 1 x n Tell me f(x) f(x)+lap( (f)) Intuition: Noise needed to mask the effect of a single input (f) = sensitivity Indrajit Roy et. al, NSDI 2010 Lap = Laplace distribution 8

88 ack to the roadmap What is the programming model? dl? Untrusted mapper + Trusted reducer How do we enforce privacy? Leaks through system resources Leaksthrough the output MAC What computations can be suppor rted in Airavat? Indrajit Roy et. al, NSDI

89 Enforcing differential privacy Mapper can be any piece of Java code ( black box ) but Range of mapper outputs must be declared in advance Used to estimate sensitivity (how much does a single input influence the output?) t?) Determines how much noise is added to outputs to ensure differential privacy Example: Consider mapper range [0, M] SUM has the estimated sensitivity of M Indrajit Roy et. al, NSDI

90 nforcing differential privacy Mlii Malicious mappers may output tvalue s outside tid the range If a mapper produces a value outside the range, it is replaced by a value inside the range User not notified otherwise possible information leak Data 1 Data 2 Data 3 Data 4 Range enforcer Mapper Mapper Range enforcer Indrajit Roy et. al, NSDI 2010 Ensures that code isnot more sensitive than declared Reducer Noise 9

91 hat can we comput te? Reducers are responsible for enforcing privacy Add an appropriate amount of random noise to the outputs Reducers must be trusted Sample reducers: SUM, COUNT, THRESHOLD Sufficient to perform data mining algorithms, search log processing, recommender system etc. With trusted mappers, more general computations are possible Use exact sensitivity instead of range based estimates Indrajit Roy et. al, NSDI

92 ample computations Many yqueries es can be don ne with untrusted usted mappers Sum How many ipads were sold today? What is the average score of male students at UT? Output the frequency of security books that sold Threshold morethan 25copies today. Mean others require trusted List all items and their quantity sold Malicious mapper can encode informationn in item names Indrajit Roy et. al, NSDI 2010 mapper code 9

93 evisiting Airavat gua arantees Allows differentially private MapRe educe computations tti Even when the code is untrusted Differential privacy => mathematical bound on information leak What is a safe bound on information leak? Depends onthe context, dataset Not our problem Indrajit Roy et. al, NSDI

94 mplementation detai ls SELinux policy Domains for trusted and untrusted programs Apply restrictions on each domain MapReduce Modifications to support mandatory access control Set of trusted reducers JVM Modifications to enforce mapper independence 500 LoC 450 LoC 5000 LoC Indrajit Roy et. al, NSDI 2010 LoC = Lines of Code 9

95 iravatinbrief in Airavat is a framework kfor privacy preserving MapReduce computations Confines untrusted code First to integrate mandatory accesss control with differential privacy for end to end enforcement Protected Airavat Untruste d Program Indrajit Roy et. al, NSDI

96 References eee Balaji Palanisamy, Aameek Singh, Ling Liu and Bryan Langston, "Cura: A Cost Optimized Model for MapReduce in IPDPS 13 Balaji Palanisamy, Aameek Singh, Ling Liu and Bhushan Jain, "Purlieus: Locality aware Resource Allocation for MapReduce in a Cloud", Proc. of IEEE/ACM Supercomputing (SC' 11). Balaji Palanisamy, Aameek Singh, Nagapramod Mandagere, Gabriel Alatorre and Ling Liu, "VNCache: Bridging Compute and Storage for Big Data in a Cloud in CCGrid Indrajit Roy, Srinath T.V. Setty, Ann Kilzer, Vitaly Shmatikov, Emmett Witchel, Airavat: Security and Privacy for MapReduce, in NSDI Balaji Palanisamy, Aameek Singh, Ling Liu and Bryan Langston, "Cura: A Cost Optimized Model for MapReduce in IEEE Transactions on Parallel and Distributed Systems (TPDS 2014) Indrajit Roy et. al, NSDI

Indrajit Roy, Srinath T.V. Setty, Ann Kilzer, Vitaly Shmatikov, Emmett Witchel The University of Texas at Austin

Airavat: Security and Privacy for MapReduce Indrajit Roy, Srinath T.V. Setty, Ann Kilzer, Vitaly Shmatikov, Emmett Witchel The University of Texas at Austin Computing in the year 201X 2 Data Illusion of