Cost-effective and Privacy-conscious MapRed du uce Cloud serv vices Guest lecture by Dr. Balaji Palanisamy

Size: px
Start display at page:

Download "Cost-effective and Privacy-conscious MapRed du uce Cloud serv vices Guest lecture by Dr. Balaji Palanisamy"

Transcription

1 Cost-effective and Privacy-conscious MapRedu uce Cloud serv vices Guest lecture by Dr. Balaji Palanisamy

2 omputing Paradigm Shift Cloud Challenges Cost effectiveness, Performance and Security

3 loud Computing Chall lenges Q: How important is it that cloud service providers (scale: 1-5; 1=not at all important, 5=very important) Offer competitive pricing Offer Service Level Agreements Provice a complete solution Understand my business and industry Allow managing on-premise & off-premise premise.. Support manay of my IT needs Offer both private & public clouds Are a technology & business model Have local presence, can come to my 92% 89% 86% 85% 82% 81% 79% 78% 73% Source: IDC Enterprise Panel, 3Q2009, n = 263 0% 20% 40% 60% 80% 100% % responding 3, 4 or 5

4 Cloud Computing Challenges Q: Rate the challenges/issue es of the 'cloud'/on demand model (1=not significant, 5=very significant) Security Performance 88.5% 88.1% Availability Hard to integrate with in house IT Not enough ability to customize Worried cloud will cost more Bringing back in house may be difficult Not enough major suppliers yet 74.6% 84.8% 84.5% 83.3% 81.1% 80.3% Source: IDC Enterprise Panel, % 70% 75% 80% 85% 90% % responding 3, 4 or 5

5 ata Growth Worldwide Corporate Data Growth IT.com analyses the 80% of data growth: unstructured data Exaby ytes Source: IDC, The Digital Universe 2010 Unstructured data Structured data

6 ig Data processing inacloud MapReduce and Big Data Processing Programming model for data intensive computing on commodity clusters Pioneered by Google Processes 20 PB of data per day Scalability to large data volumes Scan 100 TB on 1 50 MB/ /s = 24 days Scan on 1000 node cluster = 35 minutes It is estimated that, by 2015, more than half the world's data will be processed by Hadoop Hortonworks Existing Dedicated MapReduce Clouds Naturalextensionsof of rental store mod dels Performance and cost inefficiency

7 utline Goals: Cost effective and scalable Big Data Processing using MapReduce Privacy conscious Data access and Data processing in a cloud Outline Cost aware Resource Management Goal: to devise resource management techniques that yield the required performance at the lowest cost. Cura Cost optimized Model for MapReduce in a Cloud Performance driven Resource Optimization: Goal: Optimizations based on physical resource limits in a data center such as networking and storage bottleneck s Purlieus Locality aware Resource Allocation for MapReduce Privacy conscious MapReduce Systems VNCache MapReduce analysis for Cloud archived data Airavat: Security and Privacy for MapReduce

8 MapReduce - Data Parallelism li x := (a * b) + (y * z); computation A computation B At the micro level, independent algebraic operations can commute be processed in any order. If commutative operations are applied to different memory addresses, then they can also occur at the same time Compilers, CPUs often do so automatically

9 Higher-level l Parallelism li x := foo(a) + bar(b) computation A computation B Commutativity can apply to larger operations. If foo() and bar() do not manipulate the same memory, then there is no reason why these cannot occur at the same time

10 Parallelism: li Dependency Graphs x := foo(a) + bar(b) foo(a) bar(b) write x Arrows indicate dependent operations If foo and bar do not access the same memory, there is not a dependency between them These operations can occur in parallel in different threads write x operation waits for predecessors to complete 1

11 1 Master/Workers One object called the master initially owns all data. Creates several workers to process individual elements Waits for workers to report results back worker threads master

12 Example: Count word occurrences of each word in a large collect ion of documents map(string input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(string output_key, intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result)); Iterator 1

13 apreduce Executio on Overview User Progra am fork assign map fork Maste er fork assign reduce Input Data Split 0 read Split 1 Split 2 map map map local write remote read, sort reduce reduce write Output File 0 Output File 1 1

14 1 Need for Cost-optimized Cloud Usage Model ExistingPer job Optimized Models: Per job customer side greedy optimization may not be globally optimal Higher cost for customers Cura Usage Model: User submits job and specifies the required service quality in terms of job response time and is charged only for that service quality Cloud provider manages the resources to ensure each job s service requirements Other Cloud Managed Resource models: Database as a Service Model Eg: Relational Cloud (CIDR 2011) Cloud managed model for Resource management Cloud managed SQL like lk query service Delayed query model in Google Big Query execution results in 40% lower cost

15 1 User Scheduling Vs Cloud Job# Arrival time Sched duling Deadline Running time Optimal no. Of VMs User Scheduling Job # Start time concurrent VMs Cloud Scheduling Job # Start time concurrent VMs

16 ura System Architecturecture 1

17 tatic Virtual Machine sets Cluster of physical machines Pool of small instances Pool of Large instances Pool of extra large instances 1

18 1 tatic Partitioning of Virtual Machine sets Cluster of physical machines Pool of small instances Pool of Large instances Pool of extra large instances

19 1 ey Challenges in Cu ura Design Resource Provisioning and Scheduling Optimal scheduling Optimal Cluster Configuration Optimal Hadoop Configuration i Virtual Machine Management Optimal capacity planning Right set of VMs(VM types) for current workload? Minimize Capital expenditure and Opera ational expenses Resource Pricing What is the price of each job based on it ts service quality andjob characteristics?

20 M-aware Job Sched duling Multi bin backfilling Scheduler needs to decide on which instance type to use for the jobs The job scheduler has two major goals: (i) complete all job execution within the deadlines (ii) minimize operating expensee by minimizing resource usage 2

21 M-aware Schedulin g Algorithm Goal: VM aware scheduler hdl decides d () (a) when to schedule hdl each job in the job queue, (b) which VM instance pool to use and (c) how many VMs to use for the jobs. Minimum reservations withoutt under utilizing any resources. Job J i has higher priority over Job J j if the cost of scheduling J i is higher. For each VM pool picks the hig hest priority job, J prior in the job queue and makes a reservation. Subsequently, the scheduler picks the next highest priority jobs in the job queue by considering priority only with respect to the reservations that are possib ble within the current reservation time windows of the VM pools. Runs in O(n 2 ) time Straight forward to obtain a distributed implementation to scale further 2

22 econfiguration-awarawar re Scheduler Assume two new jobs need a cluster of 9 small instances and 4 large instances respectively. The scheduler has the following options: 1) Wait for some other clusters of small instances to complete execution 2) Run the job in a cluster available of extra large instances 3) Convert some large or extra large instances into multiple small instances Reconfiguration unaware Scheduler Reconfiguration aware Scheduler 2

23 2 Reconfiguration Algorithm Reconfiguration time window Observation period for collecting history of job executions. For each job, J i, the optimal cost C opt (J i ) that incurs the lowest resource usage is found. The reconfiguration plan reflects the proportion of demands for the optimal instance types. Reconfiguration Benefit:

24 2 umber of servers and Effective tilization Dedicated Cluster Per job Cluster Cura No of Servers Deadline Fig 1. No. of Servers Effective Utilization Dedicated Cluster Per job Cluster Cura Deadline Fig 2. Effective Utilization Cura requires 80% lower resources than conventional cloud models Cura achieves significantly higher resource utilization

25 2 alk Outline Cost optimized Cloud Resource Allocation Cura Cost optimized Model for MapReduce in a Cloud Performance driven Resource Optimization Purlieus Locality aware Resource Allo ocation for MapReduce Privacy conscious Processing of Cloud Data VNCache: Efficient i MapReduce Analysis s for Cloud archived d Data Airavat: Security and Privacy for MapReduce

26 2 apreduce cloud pr ovider - Challenges Different loads on the shared physical infrastructure Computation load: number and size of each VM (CPU, memory) Storage load: amount of input, output and intermediate data Network load: traffic generated during the map, shuffle andreduce phases The network load isof special Large amounts of traffic can be concern withmapreduce generated in the shuffle phase As each reduce task needs to read the output of all map tasks, a sudden explosion of network traffic can significantly deteriorate cloud performance

27 xisting Dedicated MapReduce Clouds Amazon Elastic MapReduce Model Utilize a hosted Hadoop framework running on the Compute Cloud (e.g.: Amazon EC2) and use a separate storage service (e.g.: Amazon S3) Drawbacks Adversely impacts performance as it violates data locality Unnecessary replication leading to highe er costs Processing 100 TB of data using 100 VMs takes 3 hours just to load the data 2

28 2 urlieus: Storage Arc chitecture Semi persistent storage architecture Data is broken up into chunks corresponding to MapReduce blocks stored on a distributed file system of the physical machines Ability to transition dataa stored on physical machines in a seamless manner i.e. without requiring an expl icit data loading step The data on physical machines is seamlessly made available to VMs Uses loopback mounts and VM disk attach

29 Purlieus Architecture and Locality-optimization Goals H 7 H 8 H 9 I 3 A 4 A 5 A 6 G 1 B 1 B 2 B 3 B 4 B 5 B 6 B 8 B 9 K 1 K 2 K 3 B 1 F 2 G 1 G 2 G 3 F 3 C 1 D 1 D 2 C 3 C 4 D 3 D 4 C 2 C 7 C 5 C 6 F 1 C 7 D 5 D 6 F 2 E 1 E 2 E 3 F 3 E 4 E 5 E 6 G 1 G 2 G 3 F 1 H 1 H 2 H 3 I 1 H 4 H 5 H 6 I 2 A 1 A 2 A 3 A 1 J 1 J 2 I 3 I 4

30 ob Specific Locality-awareness Job specific locality cost Job categories Map and Reduce input heavy Sort workload : map input size = map output size Map input heavy Grep workload: large map input and small map output Reduce input heavy Synthetic Dataset Generators: small map input and large map output 30

31 3 ap and Reduce inp ut-heavy jobs Example: sorting a huge dataa set (output size = input size) Both map and reduce localities are important. Map phase: Incorporates map locality by pushing compute towards data Rd Reduce phase: Make communications happen within a set of closely connected nodes

32 Scheduling Map and Reduce-input heavy tasks (map- phase) H 7 H 8 H 9 I 3 A 4 A 5 A 6 G 1 B 1 B 2 B 3 B 4 B 5 B 6 B 8 B 9 K 1 K 2 K 3 B 1 F 2 G 1 G 2 G 3 F 1 F 2 F 3 F 3 C 1 D 1 D 2 C 3 C 4 D 3 D 4 C 2 C 7 C 5 C 6 F 1 C 7 D 5 D 6 F 2 E 1 E 2 E 3 F 3 E 4 E 5 E 6 G 1 G 2 G 3 E 7 H 1 H 2 H 3 I 1 H 4 H 5 H 6 I 2 A 1 A 2 A 3 A 1 J 1 J 2 I 3 I 4

33 Scheduling Map and Reduce-input heavy tasks (reduce- phase) H 7 H 8 H 9 I 3 A 4 A 5 A 6 G 1 B 1 B 2 B 3 B 4 B 5 B 6 B 8 B 9 K 1 K 2 K 3 B 1 F 2 G 1 G 2 G 3 R 3 R 3 R 2 R 1 R 1 R 2 R1 2 R 3 R 1 F 3 C 1 D 1 D 2 C 3 C 4 D 3 D 4 C 2 C 7 C 5 C 6 F 1 C 7 D 5 D 6 F 2 E 1 E 2 E 3 F 3 E 4 E 5 E 6 G 1 G 2 G 3 F 1 H 1 H 2 H 3 I 1 H 4 H 5 H 6 I 2 A 1 A 2 A 3 A 1 J 1 J 2 I 3 I 4

34 3 xperimental Setup Cluster Setup Testbed of 20 CentOS 5.5 physical machines (KVM as the hypervisor) Each physical machine has 16 core 2.53 GHz Intel Processors. The network is 1 Gbps. Nodes are organized in 2 racks, each containing 10 physical machines Job types: By default, each job uses 20 VMs with each VM configured with 4 GB memory and 4 2GHz vcpus.

35 mpact of Reduce-loc cality Running tas sks Runnin ng tasks With only Map locality Time (sec) With both hmap and Rd Reduce locality li Time (sec) Map Shuffle Reduce Map Shuffle Reduce Timeline plotted using Hadoop job_history_summary 35

36 3 ap and Reduce-inp put heavy workload Execut tion Time Locality aware Data Placement Random Data Placement Map Locality Rd Reduce Map and No Locality Locality Reduce Locality VM Placement Fig3. Execution time Normalized Cr ross rack traffic Locality aware Data Placement Random Data Placement Map Locality Rd Reduce Map and No Locality Locality Reduce Locality VM Placement Fig 4. Cross rack rack traffic Performance depends on both Map and Reduce locality! Locality depends not only on compute placement but also on data placement

37 3 alk Outline Cost optimized Cloud Resource Allocation Cura Cost optimized Model for MapReduce in a Cloud Performance driven Resource Optimization Purlieus Locality aware Resource Alloc cation for MapReduce Privacy conscious Processing of Cloud Data VNCache: Efficient i MapReduce Analysi is for Cloud archived d Data Airavat: Security and Privacy for MapReduce

38 3 apreduce Analysis s for Cloud-archived ata Compute Cloud and Storage Cloud can not be collocated always E.g. When there is a privacy/security concern Example use case: private log data archived in Clouds Logs can contain sensitive ii information i n require data to be encrypted before moved to the cloud. Current solutions require the data to be transferred and processed in a secure enterprise cluster.

39 urrent analytics solutions for Cloudrchived Data MapReduce analysis for Cloud-archived data Current analytics platforms at secure enterprise site(e.g. Hadoop/MapReduce) first download data from these public clouds decrypt it and then proces ss it at the secure enterprise site Data needs to be loaded before the jobs can even start processing Duplicates data at both enterprise site and Storage clouds. 3

40 4 urrent analytics solutions for Cloudrchived Data Wan copy time, decrypt time and HDFS load time are predominant factors of job execution time Actual job execution is significantly slowed down because of this overhead

41 NCache: Seamless Data Caching VNCache Key idea to use a virtual HDFS Filesystem image that provides virtual input data files on the target cluster. to provide a virtual existence of the Hadoop datablocks to the HDFS enables jobs to begin execution immediately For HDFS filesystem running on top of the VNCache All data appears to be local though initially the data is at the storage cloud 4

42 NCache: Data flow Data is archived at the public storage cloud as chunks of HDFS block size For job analysis, VNCache seamlessly the storage cloud streams and prefetches data from 4

43 verview of HDFS HDFS primary storage by Hadoop applications The HDFS Namenode manages the file system namespace and regulates access to files by clients The individual id ldt Datanodes Manage the storage attached to the nodes that they run on. 4

44 4 nteraction of HDFS with its underlying ilesystem Namenode at the master node stores the HDFS filesystem, fsimage HDFS filesystem namespace includes the mapping of HDFS files to their constituent blocks Block identifiers Modification i and Access times Datanode stores a set of HDFS blocks uses separate files in the underlying filesystem both metadata and actual blocks

45 HDFS Filesyste em Image layout Consists of an Inode structure for each HDFS file Each Inode has the information for the individual data blocks corresponding to the file 4

46 4 DFS Virtualization VNCache Filesystem FUSE based filesystem used as the underlying filesystem on the Namenode and Datanode. It is a virtual filesystem (similar to /proc on Linux) simulates various files and directories to the HDFS layer placed on it Namenode Filesystem (fsimage) virtualization a virtual HDFS namespace is created with a list of relevant HDFS blocks Datanode virtualization exposes a list of data files corresponding to the HDFS blocks placed on that datanode

47 NCache s Dataflow Dataflow in the VNCache model Two models of operation: Streaming and Prefetching Seamless Streaming: For data block accesses on demand, the VNCache Filesystem streams the data seamlessly from the storage cloud. Cache Prefetching: Cache prefetching obtains the block job execution time. ahead of processing and significantly reduces 4

48 4 ache Prefetching Cache Prefetching Logic Analyzes job input data and p refetches data blocks Carefully considers the order of task processing in the input data Prefetching order task ordering based on decreasing order of file sizes in the job input based on the order of blocks in the HDFS file system image. Distributed Cache Prefetching algorithm Master Prefetcher Makes prefetching decisions and delegates prefetch tasks to slave prefetchers Slave Prefetchers transfer the individual data blocks from the remote cloud

49 4 aching Algorithm Rate adaptivenessadaptiveness monitors the progress of the Hadoop jobs in terms of task processing rate and the HDFS bloc request rate adaptively decides the set of data blocks to prefetch Cache Eviction Policy enforcesminimal caching principle. Keeps only the most immediately required blocks in the cache minimizes the total storage footprint Workflow awareness Workflows may have multiple jobs processing the same input dataset VNCache recognizes it and prefetches those blocks only once uses a persistent workflow aware caching

50 xperimental Setup Cluster Setup Testbed of 20 CentOS 5.5 ph ysical machines (KVM as the hypervisor) Each physical machine has 16 core 2.53 GHz Intel Processors. 5 nodes used as remote Storage cloud servers Individual job Workloads: Grep, Sort, Facebook jobs Workflow based Workloads: Facebook workflows, Tfidf Dataset sizes: 5 GB, 10 GB and 15 GB Network Latencies: 45 ms, 90 ms, and 150 ms using Swim workload generator By default, each job uses 5 VMs with each VM configured with 4 GB memory and 3.5GHz vcpus. 5

51 erformance of Grep workload Fig 5. Execution time VNCache:Streaming model achiev es 42% reduction in execution time VNCache Prefetch optimization further improves it by 30% 5

52 erformance of Face ebook workload Fig 6. Execution time Fig 7. Cache Hit Ratio VNCache gets more than 55% reduction in job execution time Achieves an average cache hit ratio of 43.5% 5

53 5 fidf workflow Fig 9. Execution time Fig 10. Cache Hit Ratio VNCache reduces the execution time by 47% and results in 70% cache hits on average

54 5 mpact of Cache size Grep Workload Fig 15. Execution time Fig 16. Cache Hit Ratio Even with a reasonable cache size, VNC Cache achieves good performance

55 5 alk Outline Cost optimized Cloud Resource Allocation Cura Cost optimized Model for MapReduce in a Cloud Performance driven Resource Optimization Purlieus Locality aware Resource Alloc cation for MapReduce Privacy conscious Processing of Cloud Data VNCache: Efficient i MapReduce Analysis s for Cloud archived d Data Airavat: Security and Privacy for MapReduce (slides adapted from Indrajit Roy et. al, NSDI 2010 paper)

56 omputing in the yea ar 201X Data Illusion of infinite resources Pay y only for resources used Quickly scale up or scale down Indrajit Roy et. al, NSDI

57 rogramming model in year 201X Frameworks available to ease cloud programming MapReduce: Parallel processing on clusters of machines Output Data Map Reduce Indrajit Roy et. al, NSDI 2010 Data mining Genomic computation Social networks 5

58 rogramming model in year 201X Thousands of users upload dtheir da ata Healthcare, shopping transactions, census, click stream Multiple third parties mine the dat ta for better service Example: Healthcare data dt Incentive to contribute: Cheaper insurance policies, new drug research, inventory control in drug gstores Fear: What if someone targets my personal data? Insurance company can find my illne ss and increase premium Indrajit Roy et. al, NSDI

59 rivacy in the year 20 01X? Untrusted MapReduce program Informatio n leak? Output Health Data Indrajit Roy et. al, NSDI 2010 Data mining Genomic computation Social networks 5

60 se de-identification? Ahi Achieves privacy by syntactic ti tran nsformations Scrubbing, k anonymity Insecure against attackers with ext ternal information Privacy fiascoes: AOL search logs, Netflix dataset Run untrusted co de on the original data? How do we ensure privacy of the Indrajit use Roy et. ers? al, NSDI

61 his talk: Airavat Framework for privacy preserving MapReduce computations with untrust ted code. Protected Data Airavat Untruste d Program Airavat is the elephant Indrajit Roy of et. the al, NSDI clouds 2010 (Indian mythology). 6

62 iravat guarantee ounded information leak* about any individual data after performing a MapReduce e computation. Protected Data Airavat Untruste d Program Indrajit Roy et. al, NSDI 2010 *Differential privacy 6

63 ackground: MapRed duce map(k 1,v 1 ) list(k 2,v 2 ) reduce(k 2, list(v 2 )) list(v 2 ) Data 1 Data 2 Output Data 3 Data 4 Map phase Reduce phase Indrajit Roy et. al, NSDI

64 apreduce examplee Map(input) { if (input has ip Pad) print (ipad, 1) } Reduce(key, list(v)) { print (key +, + SUM(v)) } ipad Counts no. of ipads sold Tablet PC ipad (ipad, 2) Laptop SUM Map phase Reduce phase Indrajit Roy et. al, NSDI

65 iravat model Airavat framework runs on the cloud infrastructure Cloud infrastructure: Hardware + VM Airavat: Modified MapReduce + DFS + JVM + SELinux 1 Airavat framework Trusted Indrajit Roy et. al, NSDI 2010 Cloud infrastructure 6

66 iravat model Data provider uploads her data on Airavat Sets up certain privacy parameters 2 Data provider 1 Airavat framework Trusted Indrajit Roy et. al, NSDI 2010 Cloud infrastructure 6

67 iravat model Computation provider writes data mining algorithm Untrusted, possibly malicious 2 Data provider Computation provider Program 3 Output 1 Airavat framework Trusted Indrajit Roy et. al, NSDI 2010 Cloud infrastructure 6

68 hreat model Airavat runs the computa ation, and still protects the privacy of the data providers 2 Data provider Threat Computation provider Program 3 Output 1 Airavat framework Trusted Indrajit Roy et. al, NSDI 2010 Cloud infrastructure 6

69 oadmap What is the programming model? dl? How do we enforce privacy? What computations can be supported in Airavat? Indrajit Roy et. al, NSDI

70 rogramming model Split MapReduce into untruste ed mapper + trusted reducer Limited set of stock reducers MapReduce program for data mining Data No need to audit Untrusted Trusted Mapper Reduce r Airavat Data Indrajit Roy et. al, NSDI

71 rogramming model Need to confine the mappers! Guarantee: Protect the privacy of data providers MapReduce program for data mining Data No need to audit Untrusted Trusted Mapper Reduce r Airavat Data Indrajit Roy et. al, NSDI

72 hallenge 1: Untruste ed mapper Untrusted t mapper code copies dat dt ta, sends it over the network Peter Chris Map Reduce Meg Data Indrajit Roy et. al, NSDI 2010 Leaksusing system resources 7

73 hallenge 2: Untruste ed mapper Output tof the computation tti is also an information channel Peter Chris Map Reduce Output 1 million if Peter bought Vi*gra Meg Data Indrajit Roy et. al, NSDI

74 iravat mechanisms Mandatory access control Differential privacy Prevent leaks through storage channels like network connections, files Prevent leaks through the output of the computation Output Map Reduce Data Indrajit Roy et. al, NSDI

75 ack to the roadmap What is the programming model? dl? How do we enforce privacy? Leaks through system resources Leaksthrough the output Untrusted mapper + Trusted reducer What computations can be suppor rted in Airavat? Indrajit Roy et. al, NSDI

76 iravat confines the untrusted code Untrusted program MapReduce + DFS SELinux Given by the computation tti provider Add mandatory accesss control (MAC) Add MAC policy Airavat Indrajit Roy et. al, NSDI

77 iravat confines the untrusted code Untrusted program MapReduce + DFS We add d mandatory access control to the MapReduce framework Label input, intermediate values, output Malicious code cannot leak labeled data SELinux Data 1 Data 2 Output Data 3 Access control label MapReduce Indrajit Roy et. al, NSDI

78 iravat confines the untrusted code Untrusted program MapReduce + DFS SELinux SELinux policy to enforce MAC Creates trusted and untrusted domain ns Processes and files are labeled to restrictt interaction Mappers reside in untrusted domain Denied network access, limited file system interaction Indrajit Roy et. al, NSDI

79 ut access control is not enough Labels can prevent the output from been read When can we remove the labels? Peter if (input belongs-to Peter) print (ipad, ) Output leaks the presence of Peter! ipad Tablet PC (ipad, ) ipad Laptop Map phase SUM Indrajit Roy et. al, NSDI 2010 Reduce phase Access control label 7

80 ut access control is not enough Need mechanisms to enforce that the output does not violate an individua al s privacy. Indrajit Roy et. al, NSDI

81 ackground: Differen ntial privacy A mechanism is differentiall ly private if every output is produced with similar probability whether any given input is included or not Cynthia Dwork. Indrajit Differenti Roy et. al, Privacy. NSDI 2010 ICALP

82 ifferential privacy (in ntuition) A mechanism is differentiall ly private if every output is produced with similar probability whether any given input is included or not A B C Output distribution F(x) Cynthia Dwork. Indrajit Differenti Roy et. al, Privacy. NSDI 2010 ICALP

83 ifferential privacy (in ntuition) A mechanism is differentiall ly private if every output is produced with similar probability whether any given input is included or not A B C Similar output distributions F(x) F(x) A B C D Bounded drisk ikfor D if she includes her data! Cynthia Dwork. Indrajit Differenti Roy et. al, Privacy. NSDI 2010 ICALP

84 chieving differential privacy A simple differentially private mec hanism x 1 x n Tell me f(x) () f(x) +noise How much noise should one add? Indrajit Roy et. al, NSDI

85 chieving differential privacy Function sensitivity (intuition): iti Maximum effect of any single input on the output Aim: Need to conceal this effect to preserve privacy Example: Computing the average height of the people in this room has low sensitivity Any single person s height does not affect the final average by too much Calculating the maximum height has high sensitivity Indrajit Roy et. al, NSDI

86 chieving differential privacy Function sensitivity (intuition): iti Maximum effect of any single input on the output Aim: Need to conceal this effect to preserve privacy Example: SUM over input element s drawn from [0, M] X 1 X 2 X 3 X 4 SUM Sensitivity = M Max. effect of any input element is M Indrajit Roy et. al, NSDI

87 chieving differential privacy A simple differentially private mec hanism x 1 x n Tell me f(x) f(x)+lap( (f)) Intuition: Noise needed to mask the effect of a single input (f) = sensitivity Indrajit Roy et. al, NSDI 2010 Lap = Laplace distribution 8

88 ack to the roadmap What is the programming model? dl? Untrusted mapper + Trusted reducer How do we enforce privacy? Leaks through system resources Leaksthrough the output MAC What computations can be suppor rted in Airavat? Indrajit Roy et. al, NSDI

89 Enforcing differential privacy Mapper can be any piece of Java code ( black box ) but Range of mapper outputs must be declared in advance Used to estimate sensitivity (how much does a single input influence the output?) t?) Determines how much noise is added to outputs to ensure differential privacy Example: Consider mapper range [0, M] SUM has the estimated sensitivity of M Indrajit Roy et. al, NSDI

90 nforcing differential privacy Mlii Malicious mappers may output tvalue s outside tid the range If a mapper produces a value outside the range, it is replaced by a value inside the range User not notified otherwise possible information leak Data 1 Data 2 Data 3 Data 4 Range enforcer Mapper Mapper Range enforcer Indrajit Roy et. al, NSDI 2010 Ensures that code isnot more sensitive than declared Reducer Noise 9

91 hat can we comput te? Reducers are responsible for enforcing privacy Add an appropriate amount of random noise to the outputs Reducers must be trusted Sample reducers: SUM, COUNT, THRESHOLD Sufficient to perform data mining algorithms, search log processing, recommender system etc. With trusted mappers, more general computations are possible Use exact sensitivity instead of range based estimates Indrajit Roy et. al, NSDI

92 ample computations Many yqueries es can be don ne with untrusted usted mappers Sum How many ipads were sold today? What is the average score of male students at UT? Output the frequency of security books that sold Threshold morethan 25copies today. Mean others require trusted List all items and their quantity sold Malicious mapper can encode informationn in item names Indrajit Roy et. al, NSDI 2010 mapper code 9

93 evisiting Airavat gua arantees Allows differentially private MapRe educe computations tti Even when the code is untrusted Differential privacy => mathematical bound on information leak What is a safe bound on information leak? Depends onthe context, dataset Not our problem Indrajit Roy et. al, NSDI

94 mplementation detai ls SELinux policy Domains for trusted and untrusted programs Apply restrictions on each domain MapReduce Modifications to support mandatory access control Set of trusted reducers JVM Modifications to enforce mapper independence 500 LoC 450 LoC 5000 LoC Indrajit Roy et. al, NSDI 2010 LoC = Lines of Code 9

95 iravatinbrief in Airavat is a framework kfor privacy preserving MapReduce computations Confines untrusted code First to integrate mandatory accesss control with differential privacy for end to end enforcement Protected Airavat Untruste d Program Indrajit Roy et. al, NSDI

96 References eee Balaji Palanisamy, Aameek Singh, Ling Liu and Bryan Langston, "Cura: A Cost Optimized Model for MapReduce in IPDPS 13 Balaji Palanisamy, Aameek Singh, Ling Liu and Bhushan Jain, "Purlieus: Locality aware Resource Allocation for MapReduce in a Cloud", Proc. of IEEE/ACM Supercomputing (SC' 11). Balaji Palanisamy, Aameek Singh, Nagapramod Mandagere, Gabriel Alatorre and Ling Liu, "VNCache: Bridging Compute and Storage for Big Data in a Cloud in CCGrid Indrajit Roy, Srinath T.V. Setty, Ann Kilzer, Vitaly Shmatikov, Emmett Witchel, Airavat: Security and Privacy for MapReduce, in NSDI Balaji Palanisamy, Aameek Singh, Ling Liu and Bryan Langston, "Cura: A Cost Optimized Model for MapReduce in IEEE Transactions on Parallel and Distributed Systems (TPDS 2014) Indrajit Roy et. al, NSDI

Indrajit Roy, Srinath T.V. Setty, Ann Kilzer, Vitaly Shmatikov, Emmett Witchel The University of Texas at Austin

Indrajit Roy, Srinath T.V. Setty, Ann Kilzer, Vitaly Shmatikov, Emmett Witchel The University of Texas at Austin Airavat: Security and Privacy for MapReduce Indrajit Roy, Srinath T.V. Setty, Ann Kilzer, Vitaly Shmatikov, Emmett Witchel The University of Texas at Austin Computing in the year 201X 2 Data Illusion of

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344 Where We Are Introduction to Data Management CSE 344 Lecture 22: MapReduce We are talking about parallel query processing There exist two main types of engines: Parallel DBMSs (last lecture + quick review)

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc. MapReduce and HDFS This presentation includes course content University of Washington Redistributed under the Creative Commons Attribution 3.0 license. All other contents: Overview Why MapReduce? What

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 26: Parallel Databases and MapReduce CSE 344 - Winter 2013 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Cluster will run in Amazon s cloud (AWS)

More information

MapReduce, Hadoop and Spark. Bompotas Agorakis

MapReduce, Hadoop and Spark. Bompotas Agorakis MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)

More information

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab CS6030 Cloud Computing Ajay Gupta B239, CEAS Computer Science Department Western Michigan University ajay.gupta@wmich.edu 276-3104 1 Acknowledgements I have liberally borrowed these slides and material

More information

Hadoop/MapReduce Computing Paradigm

Hadoop/MapReduce Computing Paradigm Hadoop/Reduce Computing Paradigm 1 Large-Scale Data Analytics Reduce computing paradigm (E.g., Hadoop) vs. Traditional database systems vs. Database Many enterprises are turning to Hadoop Especially applications

More information

Distributed File Systems II

Distributed File Systems II Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation

More information

MI-PDB, MIE-PDB: Advanced Database Systems

MI-PDB, MIE-PDB: Advanced Database Systems MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:

More information

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Distributed Computations MapReduce. adapted from Jeff Dean s slides Distributed Computations MapReduce adapted from Jeff Dean s slides What we ve learnt so far Basic distributed systems concepts Consistency (sequential, eventual) Fault tolerance (recoverability, availability)

More information

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA) Introduction to MapReduce Adapted from Jimmy Lin (U. Maryland, USA) Motivation Overview Need for handling big data New programming paradigm Review of functional programming mapreduce uses this abstraction

More information

CS 345A Data Mining. MapReduce

CS 345A Data Mining. MapReduce CS 345A Data Mining MapReduce Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very large Tens to hundreds of terabytes

More information

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University CS 555: DISTRIBUTED SYSTEMS [MAPREDUCE] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Bit Torrent What is the right chunk/piece

More information

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414 Introduction to Database Systems CSE 414 Lecture 17: MapReduce and Spark Announcements Midterm this Friday in class! Review session tonight See course website for OHs Includes everything up to Monday s

More information

CS 61C: Great Ideas in Computer Architecture. MapReduce

CS 61C: Great Ideas in Computer Architecture. MapReduce CS 61C: Great Ideas in Computer Architecture MapReduce Guest Lecturer: Justin Hsia 3/06/2013 Spring 2013 Lecture #18 1 Review of Last Lecture Performance latency and throughput Warehouse Scale Computing

More information

Distributed Filesystem

Distributed Filesystem Distributed Filesystem 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributing Code! Don t move data to workers move workers to the data! - Store data on the local disks of nodes in the

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

Database Systems CSE 414

Database Systems CSE 414 Database Systems CSE 414 Lecture 19: MapReduce (Ch. 20.2) CSE 414 - Fall 2017 1 Announcements HW5 is due tomorrow 11pm HW6 is posted and due Nov. 27 11pm Section Thursday on setting up Spark on AWS Create

More information

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm Announcements HW5 is due tomorrow 11pm Database Systems CSE 414 Lecture 19: MapReduce (Ch. 20.2) HW6 is posted and due Nov. 27 11pm Section Thursday on setting up Spark on AWS Create your AWS account before

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 24: MapReduce CSE 344 - Fall 2016 1 HW8 is out Last assignment! Get Amazon credits now (see instructions) Spark with Hadoop Due next wed CSE 344 - Fall 2016

More information

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ. Big Data Programming: an Introduction Spring 2015, X. Zhang Fordham Univ. Outline What the course is about? scope Introduction to big data programming Opportunity and challenge of big data Origin of Hadoop

More information

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017 Hadoop File System 1 S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y Moving Computation is Cheaper than Moving Data Motivation: Big Data! What is BigData? - Google

More information

CSE 344 MAY 2 ND MAP/REDUCE

CSE 344 MAY 2 ND MAP/REDUCE CSE 344 MAY 2 ND MAP/REDUCE ADMINISTRIVIA HW5 Due Tonight Practice midterm Section tomorrow Exam review PERFORMANCE METRICS FOR PARALLEL DBMSS Nodes = processors, computers Speedup: More nodes, same data

More information

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe Cloud Programming Programming Environment Oct 29, 2015 Osamu Tatebe Cloud Computing Only required amount of CPU and storage can be used anytime from anywhere via network Availability, throughput, reliability

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 24: MapReduce CSE 344 - Winter 215 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Due next Thursday evening Will send out reimbursement codes later

More information

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 Big Data "The world is crazy. But at least it s getting regular analysis." Izabela

More information

A BigData Tour HDFS, Ceph and MapReduce

A BigData Tour HDFS, Ceph and MapReduce A BigData Tour HDFS, Ceph and MapReduce These slides are possible thanks to these sources Jonathan Drusi - SCInet Toronto Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing SICS; Yahoo!

More information

Programming Systems for Big Data

Programming Systems for Big Data Programming Systems for Big Data CS315B Lecture 17 Including material from Kunle Olukotun Prof. Aiken CS 315B Lecture 17 1 Big Data We ve focused on parallel programming for computational science There

More information

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

MATE-EC2: A Middleware for Processing Data with Amazon Web Services MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering Ohio State University * School of Engineering

More information

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018 Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster

More information

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia MapReduce Spark Some slides are adapted from those of Jeff Dean and Matei Zaharia What have we learnt so far? Distributed storage systems consistency semantics protocols for fault tolerance Paxos, Raft,

More information

Chapter 5. The MapReduce Programming Model and Implementation

Chapter 5. The MapReduce Programming Model and Implementation Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing

More information

Dynamic Cluster Configuration Algorithm in MapReduce Cloud

Dynamic Cluster Configuration Algorithm in MapReduce Cloud Dynamic Cluster Configuration Algorithm in MapReduce Cloud Rahul Prasad Kanu, Shabeera T P, S D Madhu Kumar Computer Science and Engineering Department, National Institute of Technology Calicut Calicut,

More information

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University CPSC 426/526 Cloud Computing Ennan Zhai Computer Science Department Yale University Recall: Lec-7 In the lec-7, I talked about: - P2P vs Enterprise control - Firewall - NATs - Software defined network

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014 CS15-319 / 15-619 Cloud Computing Recitation 3 September 9 th & 11 th, 2014 Overview Last Week s Reflection --Project 1.1, Quiz 1, Unit 1 This Week s Schedule --Unit2 (module 3 & 4), Project 1.2 Questions

More information

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per

More information

MapReduce. U of Toronto, 2014

MapReduce. U of Toronto, 2014 MapReduce U of Toronto, 2014 http://www.google.org/flutrends/ca/ (2012) Average Searches Per Day: 5,134,000,000 2 Motivation Process lots of data Google processed about 24 petabytes of data per day in

More information

Clustering Lecture 8: MapReduce

Clustering Lecture 8: MapReduce Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 2. MapReduce Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Framework A programming model

More information

Accelerate Big Data Insights

Accelerate Big Data Insights Accelerate Big Data Insights Executive Summary An abundance of information isn t always helpful when time is of the essence. In the world of big data, the ability to accelerate time-to-insight can not

More information

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins MapReduce 1 Outline Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins 2 Outline Distributed File System Map-Reduce The Computational Model Map-Reduce

More information

BigData and Map Reduce VITMAC03

BigData and Map Reduce VITMAC03 BigData and Map Reduce VITMAC03 1 Motivation Process lots of data Google processed about 24 petabytes of data per day in 2009. A single machine cannot serve all the data You need a distributed system to

More information

MapReduce. Stony Brook University CSE545, Fall 2016

MapReduce. Stony Brook University CSE545, Fall 2016 MapReduce Stony Brook University CSE545, Fall 2016 Classical Data Mining CPU Memory Disk Classical Data Mining CPU Memory (64 GB) Disk Classical Data Mining CPU Memory (64 GB) Disk Classical Data Mining

More information

2013 AWS Worldwide Public Sector Summit Washington, D.C.

2013 AWS Worldwide Public Sector Summit Washington, D.C. 2013 AWS Worldwide Public Sector Summit Washington, D.C. EMR for Fun and for Profit Ben Butler Sr. Manager, Big Data butlerb@amazon.com @bensbutler Overview 1. What is big data? 2. What is AWS Elastic

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

Distributed Systems CS6421

Distributed Systems CS6421 Distributed Systems CS6421 Intro to Distributed Systems and the Cloud Prof. Tim Wood v I teach: Software Engineering, Operating Systems, Sr. Design I like: distributed systems, networks, building cool

More information

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies Informa)on Retrieval and Map- Reduce Implementa)ons Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies mas4108@louisiana.edu Map-Reduce: Why? Need to process 100TB datasets On 1 node:

More information

MapReduce: Simplified Data Processing on Large Clusters

MapReduce: Simplified Data Processing on Large Clusters MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat OSDI 2004 Presented by Zachary Bischof Winter '10 EECS 345 Distributed Systems 1 Motivation Summary Example Implementation

More information

CAVA: Exploring Memory Locality for Big Data Analytics in Virtualized Clusters

CAVA: Exploring Memory Locality for Big Data Analytics in Virtualized Clusters 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing : Exploring Memory Locality for Big Data Analytics in Virtualized Clusters Eunji Hwang, Hyungoo Kim, Beomseok Nam and Young-ri

More information

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014 Parallel Data Processing with Hadoop/MapReduce CS140 Tao Yang, 2014 Overview What is MapReduce? Example with word counting Parallel data processing with MapReduce Hadoop file system More application example

More information

Dept. Of Computer Science, Colorado State University

Dept. Of Computer Science, Colorado State University CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [HADOOP/HDFS] Trying to have your cake and eat it too Each phase pines for tasks with locality and their numbers on a tether Alas within a phase, you get one,

More information

MapReduce-style data processing

MapReduce-style data processing MapReduce-style data processing Software Languages Team University of Koblenz-Landau Ralf Lämmel and Andrei Varanovich Related meanings of MapReduce Functional programming with map & reduce An algorithmic

More information

Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands

Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands Unleash Your Data Center s Hidden Power September 16, 2014 Molly Rector CMO, EVP Product Management & WW Marketing

More information

Batch Processing Basic architecture

Batch Processing Basic architecture Batch Processing Basic architecture in big data systems COS 518: Distributed Systems Lecture 10 Andrew Or, Mike Freedman 2 1 2 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores 3

More information

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE BRETT WENINGER, MANAGING DIRECTOR 10/21/2014 ADURANT APPROACH TO BIG DATA Align to Un/Semi-structured Data Instead of Big Scale out will become Big Greatest

More information

Improving the MapReduce Big Data Processing Framework

Improving the MapReduce Big Data Processing Framework Improving the MapReduce Big Data Processing Framework Gistau, Reza Akbarinia, Patrick Valduriez INRIA & LIRMM, Montpellier, France In collaboration with Divyakant Agrawal, UCSB Esther Pacitti, UM2, LIRMM

More information

Introduction to Map Reduce

Introduction to Map Reduce Introduction to Map Reduce 1 Map Reduce: Motivation We realized that most of our computations involved applying a map operation to each logical record in our input in order to compute a set of intermediate

More information

Database Applications (15-415)

Database Applications (15-415) Database Applications (15-415) Hadoop Lecture 24, April 23, 2014 Mohammad Hammoud Today Last Session: NoSQL databases Today s Session: Hadoop = HDFS + MapReduce Announcements: Final Exam is on Sunday April

More information

Cloud Computing CS

Cloud Computing CS Cloud Computing CS 15-319 Programming Models- Part III Lecture 6, Feb 1, 2012 Majd F. Sakr and Mohammad Hammoud 1 Today Last session Programming Models- Part II Today s session Programming Models Part

More information

A brief history on Hadoop

A brief history on Hadoop Hadoop Basics A brief history on Hadoop 2003 - Google launches project Nutch to handle billions of searches and indexing millions of web pages. Oct 2003 - Google releases papers with GFS (Google File System)

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2013/14 MapReduce & Hadoop The new world of Big Data (programming model) Overview of this Lecture Module Background Cluster File

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

Distributed Systems 16. Distributed File Systems II

Distributed Systems 16. Distributed File Systems II Distributed Systems 16. Distributed File Systems II Paul Krzyzanowski pxk@cs.rutgers.edu 1 Review NFS RPC-based access AFS Long-term caching CODA Read/write replication & disconnected operation DFS AFS

More information

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Google File System (GFS) and Hadoop Distributed File System (HDFS) Google File System (GFS) and Hadoop Distributed File System (HDFS) 1 Hadoop: Architectural Design Principles Linear scalability More nodes can do more work within the same time Linear on data size, linear

More information

Embedded Technosolutions

Embedded Technosolutions Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication

More information

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing ECE5610/CSC6220 Introduction to Parallel and Distribution Computing Lecture 6: MapReduce in Parallel Computing 1 MapReduce: Simplified Data Processing Motivation Large-Scale Data Processing on Large Clusters

More information

Data Analysis Using MapReduce in Hadoop Environment

Data Analysis Using MapReduce in Hadoop Environment Data Analysis Using MapReduce in Hadoop Environment Muhammad Khairul Rijal Muhammad*, Saiful Adli Ismail, Mohd Nazri Kama, Othman Mohd Yusop, Azri Azmi Advanced Informatics School (UTM AIS), Universiti

More information

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

EXTRACT DATA IN LARGE DATABASE WITH HADOOP International Journal of Advances in Engineering & Scientific Research (IJAESR) ISSN: 2349 3607 (Online), ISSN: 2349 4824 (Print) Download Full paper from : http://www.arseam.com/content/volume-1-issue-7-nov-2014-0

More information

MapReduce and Hadoop

MapReduce and Hadoop Università degli Studi di Roma Tor Vergata MapReduce and Hadoop Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria Cardellini The reference Big Data stack High-level Interfaces Data Processing

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

Hadoop An Overview. - Socrates CCDH

Hadoop An Overview. - Socrates CCDH Hadoop An Overview - Socrates CCDH What is Big Data? Volume Not Gigabyte. Terabyte, Petabyte, Exabyte, Zettabyte - Due to handheld gadgets,and HD format images and videos - In total data, 90% of them collected

More information

Practice and Applications of Data Management CMPSCI 345. Lecture 18: Big Data, Hadoop, and MapReduce

Practice and Applications of Data Management CMPSCI 345. Lecture 18: Big Data, Hadoop, and MapReduce Practice and Applications of Data Management CMPSCI 345 Lecture 18: Big Data, Hadoop, and MapReduce Why Big Data, Hadoop, M-R? } What is the connec,on with the things we learned? } What about SQL? } What

More information

Part A: MapReduce. Introduction Model Implementation issues

Part A: MapReduce. Introduction Model Implementation issues Part A: Massive Parallelism li with MapReduce Introduction Model Implementation issues Acknowledgements Map-Reduce The material is largely based on material from the Stanford cources CS246, CS345A and

More information

Introduction to MapReduce (cont.)

Introduction to MapReduce (cont.) Introduction to MapReduce (cont.) Rafael Ferreira da Silva rafsilva@isi.edu http://rafaelsilva.com USC INF 553 Foundations and Applications of Data Mining (Fall 2018) 2 MapReduce: Summary USC INF 553 Foundations

More information

Enhanced Hadoop with Search and MapReduce Concurrency Optimization

Enhanced Hadoop with Search and MapReduce Concurrency Optimization Volume 114 No. 12 2017, 323-331 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu Enhanced Hadoop with Search and MapReduce Concurrency Optimization

More information

The amount of data increases every day Some numbers ( 2012):

The amount of data increases every day Some numbers ( 2012): 1 The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect

More information

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark CSE 544 Principles of Database Management Systems Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark Announcements HW2 due this Thursday AWS accounts Any success? Feel

More information

2/26/2017. The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012): The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect to

More information

Hedvig as backup target for Veeam

Hedvig as backup target for Veeam Hedvig as backup target for Veeam Solution Whitepaper Version 1.0 April 2018 Table of contents Executive overview... 3 Introduction... 3 Solution components... 4 Hedvig... 4 Hedvig Virtual Disk (vdisk)...

More information

Key aspects of cloud computing. Towards fuller utilization. Two main sources of resource demand. Cluster Scheduling

Key aspects of cloud computing. Towards fuller utilization. Two main sources of resource demand. Cluster Scheduling Key aspects of cloud computing Cluster Scheduling 1. Illusion of infinite computing resources available on demand, eliminating need for up-front provisioning. The elimination of an up-front commitment

More information

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018 Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse

Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse Thanh-Chung Dao 1 Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse Thanh-Chung Dao and Shigeru Chiba The University of Tokyo Thanh-Chung Dao 2 Supercomputers Expensive clusters Multi-core

More information

Hadoop and HDFS Overview. Madhu Ankam

Hadoop and HDFS Overview. Madhu Ankam Hadoop and HDFS Overview Madhu Ankam Why Hadoop We are gathering more data than ever Examples of data : Server logs Web logs Financial transactions Analytics Emails and text messages Social media like

More information

A SURVEY ON SCHEDULING IN HADOOP FOR BIGDATA PROCESSING

A SURVEY ON SCHEDULING IN HADOOP FOR BIGDATA PROCESSING Journal homepage: www.mjret.in ISSN:2348-6953 A SURVEY ON SCHEDULING IN HADOOP FOR BIGDATA PROCESSING Bhavsar Nikhil, Bhavsar Riddhikesh,Patil Balu,Tad Mukesh Department of Computer Engineering JSPM s

More information

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department

More information

EsgynDB Enterprise 2.0 Platform Reference Architecture

EsgynDB Enterprise 2.0 Platform Reference Architecture EsgynDB Enterprise 2.0 Platform Reference Architecture This document outlines a Platform Reference Architecture for EsgynDB Enterprise, built on Apache Trafodion (Incubating) implementation with licensed

More information

Cloud Computing & Visualization

Cloud Computing & Visualization Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International

More information

CS60021: Scalable Data Mining. Sourangshu Bhattacharya

CS60021: Scalable Data Mining. Sourangshu Bhattacharya CS60021: Scalable Data Mining Sourangshu Bhattacharya In this Lecture: Outline: HDFS Motivation HDFS User commands HDFS System architecture HDFS Implementation details Sourangshu Bhattacharya Computer

More information

Abstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight

Abstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight ESG Lab Review InterSystems Data Platform: A Unified, Efficient Data Platform for Fast Business Insight Date: April 218 Author: Kerry Dolan, Senior IT Validation Analyst Abstract Enterprise Strategy Group

More information

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic WHITE PAPER Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic Western Digital Technologies, Inc. 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Executive

More information

Hadoop Distributed File System(HDFS)

Hadoop Distributed File System(HDFS) Hadoop Distributed File System(HDFS) Bu eğitim sunumları İstanbul Kalkınma Ajansı nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı kapsamında yürütülmekte olan TR10/16/YNY/0036 no lu İstanbul

More information

/ Cloud Computing. Recitation 3 Sep 13 & 15, 2016

/ Cloud Computing. Recitation 3 Sep 13 & 15, 2016 15-319 / 15-619 Cloud Computing Recitation 3 Sep 13 & 15, 2016 1 Overview Administrative Issues Last Week s Reflection Project 1.1, OLI Unit 1, Quiz 1 This Week s Schedule Project1.2, OLI Unit 2, Module

More information

The Google File System. Alexandru Costan

The Google File System. Alexandru Costan 1 The Google File System Alexandru Costan Actions on Big Data 2 Storage Analysis Acquisition Handling the data stream Data structured unstructured semi-structured Results Transactions Outline File systems

More information

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568 FLAT DATACENTER STORAGE Paper-3 Presenter-Pratik Bhatt fx6568 FDS Main discussion points A cluster storage system Stores giant "blobs" - 128-bit ID, multi-megabyte content Clients and servers connected

More information

Map-Reduce. Marco Mura 2010 March, 31th

Map-Reduce. Marco Mura 2010 March, 31th Map-Reduce Marco Mura (mura@di.unipi.it) 2010 March, 31th This paper is a note from the 2009-2010 course Strumenti di programmazione per sistemi paralleli e distribuiti and it s based by the lessons of

More information

IntegrityMR: Integrity Assurance Framework for Big Data Analytics and Management

IntegrityMR: Integrity Assurance Framework for Big Data Analytics and Management IntegrityMR: Integrity Assurance Framework for Big Data Analytics and Management Applications Yongzhi Wang, Jinpeng Wei Florida International University Mudhakar Srivatsa IBM T.J. Watson Research Center

More information