Cost-effective and Privacy-conscious MapRed du uce Cloud serv vices Guest lecture by Dr. Balaji Palanisamy
|
|
- Claude Thompson
- 6 years ago
- Views:
Transcription
1 Cost-effective and Privacy-conscious MapRedu uce Cloud serv vices Guest lecture by Dr. Balaji Palanisamy
2 omputing Paradigm Shift Cloud Challenges Cost effectiveness, Performance and Security
3 loud Computing Chall lenges Q: How important is it that cloud service providers (scale: 1-5; 1=not at all important, 5=very important) Offer competitive pricing Offer Service Level Agreements Provice a complete solution Understand my business and industry Allow managing on-premise & off-premise premise.. Support manay of my IT needs Offer both private & public clouds Are a technology & business model Have local presence, can come to my 92% 89% 86% 85% 82% 81% 79% 78% 73% Source: IDC Enterprise Panel, 3Q2009, n = 263 0% 20% 40% 60% 80% 100% % responding 3, 4 or 5
4 Cloud Computing Challenges Q: Rate the challenges/issue es of the 'cloud'/on demand model (1=not significant, 5=very significant) Security Performance 88.5% 88.1% Availability Hard to integrate with in house IT Not enough ability to customize Worried cloud will cost more Bringing back in house may be difficult Not enough major suppliers yet 74.6% 84.8% 84.5% 83.3% 81.1% 80.3% Source: IDC Enterprise Panel, % 70% 75% 80% 85% 90% % responding 3, 4 or 5
5 ata Growth Worldwide Corporate Data Growth IT.com analyses the 80% of data growth: unstructured data Exaby ytes Source: IDC, The Digital Universe 2010 Unstructured data Structured data
6 ig Data processing inacloud MapReduce and Big Data Processing Programming model for data intensive computing on commodity clusters Pioneered by Google Processes 20 PB of data per day Scalability to large data volumes Scan 100 TB on 1 50 MB/ /s = 24 days Scan on 1000 node cluster = 35 minutes It is estimated that, by 2015, more than half the world's data will be processed by Hadoop Hortonworks Existing Dedicated MapReduce Clouds Naturalextensionsof of rental store mod dels Performance and cost inefficiency
7 utline Goals: Cost effective and scalable Big Data Processing using MapReduce Privacy conscious Data access and Data processing in a cloud Outline Cost aware Resource Management Goal: to devise resource management techniques that yield the required performance at the lowest cost. Cura Cost optimized Model for MapReduce in a Cloud Performance driven Resource Optimization: Goal: Optimizations based on physical resource limits in a data center such as networking and storage bottleneck s Purlieus Locality aware Resource Allocation for MapReduce Privacy conscious MapReduce Systems VNCache MapReduce analysis for Cloud archived data Airavat: Security and Privacy for MapReduce
8 MapReduce - Data Parallelism li x := (a * b) + (y * z); computation A computation B At the micro level, independent algebraic operations can commute be processed in any order. If commutative operations are applied to different memory addresses, then they can also occur at the same time Compilers, CPUs often do so automatically
9 Higher-level l Parallelism li x := foo(a) + bar(b) computation A computation B Commutativity can apply to larger operations. If foo() and bar() do not manipulate the same memory, then there is no reason why these cannot occur at the same time
10 Parallelism: li Dependency Graphs x := foo(a) + bar(b) foo(a) bar(b) write x Arrows indicate dependent operations If foo and bar do not access the same memory, there is not a dependency between them These operations can occur in parallel in different threads write x operation waits for predecessors to complete 1
11 1 Master/Workers One object called the master initially owns all data. Creates several workers to process individual elements Waits for workers to report results back worker threads master
12 Example: Count word occurrences of each word in a large collect ion of documents map(string input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(string output_key, intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result)); Iterator 1
13 apreduce Executio on Overview User Progra am fork assign map fork Maste er fork assign reduce Input Data Split 0 read Split 1 Split 2 map map map local write remote read, sort reduce reduce write Output File 0 Output File 1 1
14 1 Need for Cost-optimized Cloud Usage Model ExistingPer job Optimized Models: Per job customer side greedy optimization may not be globally optimal Higher cost for customers Cura Usage Model: User submits job and specifies the required service quality in terms of job response time and is charged only for that service quality Cloud provider manages the resources to ensure each job s service requirements Other Cloud Managed Resource models: Database as a Service Model Eg: Relational Cloud (CIDR 2011) Cloud managed model for Resource management Cloud managed SQL like lk query service Delayed query model in Google Big Query execution results in 40% lower cost
15 1 User Scheduling Vs Cloud Job# Arrival time Sched duling Deadline Running time Optimal no. Of VMs User Scheduling Job # Start time concurrent VMs Cloud Scheduling Job # Start time concurrent VMs
16 ura System Architecturecture 1
17 tatic Virtual Machine sets Cluster of physical machines Pool of small instances Pool of Large instances Pool of extra large instances 1
18 1 tatic Partitioning of Virtual Machine sets Cluster of physical machines Pool of small instances Pool of Large instances Pool of extra large instances
19 1 ey Challenges in Cu ura Design Resource Provisioning and Scheduling Optimal scheduling Optimal Cluster Configuration Optimal Hadoop Configuration i Virtual Machine Management Optimal capacity planning Right set of VMs(VM types) for current workload? Minimize Capital expenditure and Opera ational expenses Resource Pricing What is the price of each job based on it ts service quality andjob characteristics?
20 M-aware Job Sched duling Multi bin backfilling Scheduler needs to decide on which instance type to use for the jobs The job scheduler has two major goals: (i) complete all job execution within the deadlines (ii) minimize operating expensee by minimizing resource usage 2
21 M-aware Schedulin g Algorithm Goal: VM aware scheduler hdl decides d () (a) when to schedule hdl each job in the job queue, (b) which VM instance pool to use and (c) how many VMs to use for the jobs. Minimum reservations withoutt under utilizing any resources. Job J i has higher priority over Job J j if the cost of scheduling J i is higher. For each VM pool picks the hig hest priority job, J prior in the job queue and makes a reservation. Subsequently, the scheduler picks the next highest priority jobs in the job queue by considering priority only with respect to the reservations that are possib ble within the current reservation time windows of the VM pools. Runs in O(n 2 ) time Straight forward to obtain a distributed implementation to scale further 2
22 econfiguration-awarawar re Scheduler Assume two new jobs need a cluster of 9 small instances and 4 large instances respectively. The scheduler has the following options: 1) Wait for some other clusters of small instances to complete execution 2) Run the job in a cluster available of extra large instances 3) Convert some large or extra large instances into multiple small instances Reconfiguration unaware Scheduler Reconfiguration aware Scheduler 2
23 2 Reconfiguration Algorithm Reconfiguration time window Observation period for collecting history of job executions. For each job, J i, the optimal cost C opt (J i ) that incurs the lowest resource usage is found. The reconfiguration plan reflects the proportion of demands for the optimal instance types. Reconfiguration Benefit:
24 2 umber of servers and Effective tilization Dedicated Cluster Per job Cluster Cura No of Servers Deadline Fig 1. No. of Servers Effective Utilization Dedicated Cluster Per job Cluster Cura Deadline Fig 2. Effective Utilization Cura requires 80% lower resources than conventional cloud models Cura achieves significantly higher resource utilization
25 2 alk Outline Cost optimized Cloud Resource Allocation Cura Cost optimized Model for MapReduce in a Cloud Performance driven Resource Optimization Purlieus Locality aware Resource Allo ocation for MapReduce Privacy conscious Processing of Cloud Data VNCache: Efficient i MapReduce Analysis s for Cloud archived d Data Airavat: Security and Privacy for MapReduce
26 2 apreduce cloud pr ovider - Challenges Different loads on the shared physical infrastructure Computation load: number and size of each VM (CPU, memory) Storage load: amount of input, output and intermediate data Network load: traffic generated during the map, shuffle andreduce phases The network load isof special Large amounts of traffic can be concern withmapreduce generated in the shuffle phase As each reduce task needs to read the output of all map tasks, a sudden explosion of network traffic can significantly deteriorate cloud performance
27 xisting Dedicated MapReduce Clouds Amazon Elastic MapReduce Model Utilize a hosted Hadoop framework running on the Compute Cloud (e.g.: Amazon EC2) and use a separate storage service (e.g.: Amazon S3) Drawbacks Adversely impacts performance as it violates data locality Unnecessary replication leading to highe er costs Processing 100 TB of data using 100 VMs takes 3 hours just to load the data 2
28 2 urlieus: Storage Arc chitecture Semi persistent storage architecture Data is broken up into chunks corresponding to MapReduce blocks stored on a distributed file system of the physical machines Ability to transition dataa stored on physical machines in a seamless manner i.e. without requiring an expl icit data loading step The data on physical machines is seamlessly made available to VMs Uses loopback mounts and VM disk attach
29 Purlieus Architecture and Locality-optimization Goals H 7 H 8 H 9 I 3 A 4 A 5 A 6 G 1 B 1 B 2 B 3 B 4 B 5 B 6 B 8 B 9 K 1 K 2 K 3 B 1 F 2 G 1 G 2 G 3 F 3 C 1 D 1 D 2 C 3 C 4 D 3 D 4 C 2 C 7 C 5 C 6 F 1 C 7 D 5 D 6 F 2 E 1 E 2 E 3 F 3 E 4 E 5 E 6 G 1 G 2 G 3 F 1 H 1 H 2 H 3 I 1 H 4 H 5 H 6 I 2 A 1 A 2 A 3 A 1 J 1 J 2 I 3 I 4
30 ob Specific Locality-awareness Job specific locality cost Job categories Map and Reduce input heavy Sort workload : map input size = map output size Map input heavy Grep workload: large map input and small map output Reduce input heavy Synthetic Dataset Generators: small map input and large map output 30
31 3 ap and Reduce inp ut-heavy jobs Example: sorting a huge dataa set (output size = input size) Both map and reduce localities are important. Map phase: Incorporates map locality by pushing compute towards data Rd Reduce phase: Make communications happen within a set of closely connected nodes
32 Scheduling Map and Reduce-input heavy tasks (map- phase) H 7 H 8 H 9 I 3 A 4 A 5 A 6 G 1 B 1 B 2 B 3 B 4 B 5 B 6 B 8 B 9 K 1 K 2 K 3 B 1 F 2 G 1 G 2 G 3 F 1 F 2 F 3 F 3 C 1 D 1 D 2 C 3 C 4 D 3 D 4 C 2 C 7 C 5 C 6 F 1 C 7 D 5 D 6 F 2 E 1 E 2 E 3 F 3 E 4 E 5 E 6 G 1 G 2 G 3 E 7 H 1 H 2 H 3 I 1 H 4 H 5 H 6 I 2 A 1 A 2 A 3 A 1 J 1 J 2 I 3 I 4
33 Scheduling Map and Reduce-input heavy tasks (reduce- phase) H 7 H 8 H 9 I 3 A 4 A 5 A 6 G 1 B 1 B 2 B 3 B 4 B 5 B 6 B 8 B 9 K 1 K 2 K 3 B 1 F 2 G 1 G 2 G 3 R 3 R 3 R 2 R 1 R 1 R 2 R1 2 R 3 R 1 F 3 C 1 D 1 D 2 C 3 C 4 D 3 D 4 C 2 C 7 C 5 C 6 F 1 C 7 D 5 D 6 F 2 E 1 E 2 E 3 F 3 E 4 E 5 E 6 G 1 G 2 G 3 F 1 H 1 H 2 H 3 I 1 H 4 H 5 H 6 I 2 A 1 A 2 A 3 A 1 J 1 J 2 I 3 I 4
34 3 xperimental Setup Cluster Setup Testbed of 20 CentOS 5.5 physical machines (KVM as the hypervisor) Each physical machine has 16 core 2.53 GHz Intel Processors. The network is 1 Gbps. Nodes are organized in 2 racks, each containing 10 physical machines Job types: By default, each job uses 20 VMs with each VM configured with 4 GB memory and 4 2GHz vcpus.
35 mpact of Reduce-loc cality Running tas sks Runnin ng tasks With only Map locality Time (sec) With both hmap and Rd Reduce locality li Time (sec) Map Shuffle Reduce Map Shuffle Reduce Timeline plotted using Hadoop job_history_summary 35
36 3 ap and Reduce-inp put heavy workload Execut tion Time Locality aware Data Placement Random Data Placement Map Locality Rd Reduce Map and No Locality Locality Reduce Locality VM Placement Fig3. Execution time Normalized Cr ross rack traffic Locality aware Data Placement Random Data Placement Map Locality Rd Reduce Map and No Locality Locality Reduce Locality VM Placement Fig 4. Cross rack rack traffic Performance depends on both Map and Reduce locality! Locality depends not only on compute placement but also on data placement
37 3 alk Outline Cost optimized Cloud Resource Allocation Cura Cost optimized Model for MapReduce in a Cloud Performance driven Resource Optimization Purlieus Locality aware Resource Alloc cation for MapReduce Privacy conscious Processing of Cloud Data VNCache: Efficient i MapReduce Analysi is for Cloud archived d Data Airavat: Security and Privacy for MapReduce
38 3 apreduce Analysis s for Cloud-archived ata Compute Cloud and Storage Cloud can not be collocated always E.g. When there is a privacy/security concern Example use case: private log data archived in Clouds Logs can contain sensitive ii information i n require data to be encrypted before moved to the cloud. Current solutions require the data to be transferred and processed in a secure enterprise cluster.
39 urrent analytics solutions for Cloudrchived Data MapReduce analysis for Cloud-archived data Current analytics platforms at secure enterprise site(e.g. Hadoop/MapReduce) first download data from these public clouds decrypt it and then proces ss it at the secure enterprise site Data needs to be loaded before the jobs can even start processing Duplicates data at both enterprise site and Storage clouds. 3
40 4 urrent analytics solutions for Cloudrchived Data Wan copy time, decrypt time and HDFS load time are predominant factors of job execution time Actual job execution is significantly slowed down because of this overhead
41 NCache: Seamless Data Caching VNCache Key idea to use a virtual HDFS Filesystem image that provides virtual input data files on the target cluster. to provide a virtual existence of the Hadoop datablocks to the HDFS enables jobs to begin execution immediately For HDFS filesystem running on top of the VNCache All data appears to be local though initially the data is at the storage cloud 4
42 NCache: Data flow Data is archived at the public storage cloud as chunks of HDFS block size For job analysis, VNCache seamlessly the storage cloud streams and prefetches data from 4
43 verview of HDFS HDFS primary storage by Hadoop applications The HDFS Namenode manages the file system namespace and regulates access to files by clients The individual id ldt Datanodes Manage the storage attached to the nodes that they run on. 4
44 4 nteraction of HDFS with its underlying ilesystem Namenode at the master node stores the HDFS filesystem, fsimage HDFS filesystem namespace includes the mapping of HDFS files to their constituent blocks Block identifiers Modification i and Access times Datanode stores a set of HDFS blocks uses separate files in the underlying filesystem both metadata and actual blocks
45 HDFS Filesyste em Image layout Consists of an Inode structure for each HDFS file Each Inode has the information for the individual data blocks corresponding to the file 4
46 4 DFS Virtualization VNCache Filesystem FUSE based filesystem used as the underlying filesystem on the Namenode and Datanode. It is a virtual filesystem (similar to /proc on Linux) simulates various files and directories to the HDFS layer placed on it Namenode Filesystem (fsimage) virtualization a virtual HDFS namespace is created with a list of relevant HDFS blocks Datanode virtualization exposes a list of data files corresponding to the HDFS blocks placed on that datanode
47 NCache s Dataflow Dataflow in the VNCache model Two models of operation: Streaming and Prefetching Seamless Streaming: For data block accesses on demand, the VNCache Filesystem streams the data seamlessly from the storage cloud. Cache Prefetching: Cache prefetching obtains the block job execution time. ahead of processing and significantly reduces 4
48 4 ache Prefetching Cache Prefetching Logic Analyzes job input data and p refetches data blocks Carefully considers the order of task processing in the input data Prefetching order task ordering based on decreasing order of file sizes in the job input based on the order of blocks in the HDFS file system image. Distributed Cache Prefetching algorithm Master Prefetcher Makes prefetching decisions and delegates prefetch tasks to slave prefetchers Slave Prefetchers transfer the individual data blocks from the remote cloud
49 4 aching Algorithm Rate adaptivenessadaptiveness monitors the progress of the Hadoop jobs in terms of task processing rate and the HDFS bloc request rate adaptively decides the set of data blocks to prefetch Cache Eviction Policy enforcesminimal caching principle. Keeps only the most immediately required blocks in the cache minimizes the total storage footprint Workflow awareness Workflows may have multiple jobs processing the same input dataset VNCache recognizes it and prefetches those blocks only once uses a persistent workflow aware caching
50 xperimental Setup Cluster Setup Testbed of 20 CentOS 5.5 ph ysical machines (KVM as the hypervisor) Each physical machine has 16 core 2.53 GHz Intel Processors. 5 nodes used as remote Storage cloud servers Individual job Workloads: Grep, Sort, Facebook jobs Workflow based Workloads: Facebook workflows, Tfidf Dataset sizes: 5 GB, 10 GB and 15 GB Network Latencies: 45 ms, 90 ms, and 150 ms using Swim workload generator By default, each job uses 5 VMs with each VM configured with 4 GB memory and 3.5GHz vcpus. 5
51 erformance of Grep workload Fig 5. Execution time VNCache:Streaming model achiev es 42% reduction in execution time VNCache Prefetch optimization further improves it by 30% 5
52 erformance of Face ebook workload Fig 6. Execution time Fig 7. Cache Hit Ratio VNCache gets more than 55% reduction in job execution time Achieves an average cache hit ratio of 43.5% 5
53 5 fidf workflow Fig 9. Execution time Fig 10. Cache Hit Ratio VNCache reduces the execution time by 47% and results in 70% cache hits on average
54 5 mpact of Cache size Grep Workload Fig 15. Execution time Fig 16. Cache Hit Ratio Even with a reasonable cache size, VNC Cache achieves good performance
55 5 alk Outline Cost optimized Cloud Resource Allocation Cura Cost optimized Model for MapReduce in a Cloud Performance driven Resource Optimization Purlieus Locality aware Resource Alloc cation for MapReduce Privacy conscious Processing of Cloud Data VNCache: Efficient i MapReduce Analysis s for Cloud archived d Data Airavat: Security and Privacy for MapReduce (slides adapted from Indrajit Roy et. al, NSDI 2010 paper)
56 omputing in the yea ar 201X Data Illusion of infinite resources Pay y only for resources used Quickly scale up or scale down Indrajit Roy et. al, NSDI
57 rogramming model in year 201X Frameworks available to ease cloud programming MapReduce: Parallel processing on clusters of machines Output Data Map Reduce Indrajit Roy et. al, NSDI 2010 Data mining Genomic computation Social networks 5
58 rogramming model in year 201X Thousands of users upload dtheir da ata Healthcare, shopping transactions, census, click stream Multiple third parties mine the dat ta for better service Example: Healthcare data dt Incentive to contribute: Cheaper insurance policies, new drug research, inventory control in drug gstores Fear: What if someone targets my personal data? Insurance company can find my illne ss and increase premium Indrajit Roy et. al, NSDI
59 rivacy in the year 20 01X? Untrusted MapReduce program Informatio n leak? Output Health Data Indrajit Roy et. al, NSDI 2010 Data mining Genomic computation Social networks 5
60 se de-identification? Ahi Achieves privacy by syntactic ti tran nsformations Scrubbing, k anonymity Insecure against attackers with ext ternal information Privacy fiascoes: AOL search logs, Netflix dataset Run untrusted co de on the original data? How do we ensure privacy of the Indrajit use Roy et. ers? al, NSDI
61 his talk: Airavat Framework for privacy preserving MapReduce computations with untrust ted code. Protected Data Airavat Untruste d Program Airavat is the elephant Indrajit Roy of et. the al, NSDI clouds 2010 (Indian mythology). 6
62 iravat guarantee ounded information leak* about any individual data after performing a MapReduce e computation. Protected Data Airavat Untruste d Program Indrajit Roy et. al, NSDI 2010 *Differential privacy 6
63 ackground: MapRed duce map(k 1,v 1 ) list(k 2,v 2 ) reduce(k 2, list(v 2 )) list(v 2 ) Data 1 Data 2 Output Data 3 Data 4 Map phase Reduce phase Indrajit Roy et. al, NSDI
64 apreduce examplee Map(input) { if (input has ip Pad) print (ipad, 1) } Reduce(key, list(v)) { print (key +, + SUM(v)) } ipad Counts no. of ipads sold Tablet PC ipad (ipad, 2) Laptop SUM Map phase Reduce phase Indrajit Roy et. al, NSDI
65 iravat model Airavat framework runs on the cloud infrastructure Cloud infrastructure: Hardware + VM Airavat: Modified MapReduce + DFS + JVM + SELinux 1 Airavat framework Trusted Indrajit Roy et. al, NSDI 2010 Cloud infrastructure 6
66 iravat model Data provider uploads her data on Airavat Sets up certain privacy parameters 2 Data provider 1 Airavat framework Trusted Indrajit Roy et. al, NSDI 2010 Cloud infrastructure 6
67 iravat model Computation provider writes data mining algorithm Untrusted, possibly malicious 2 Data provider Computation provider Program 3 Output 1 Airavat framework Trusted Indrajit Roy et. al, NSDI 2010 Cloud infrastructure 6
68 hreat model Airavat runs the computa ation, and still protects the privacy of the data providers 2 Data provider Threat Computation provider Program 3 Output 1 Airavat framework Trusted Indrajit Roy et. al, NSDI 2010 Cloud infrastructure 6
69 oadmap What is the programming model? dl? How do we enforce privacy? What computations can be supported in Airavat? Indrajit Roy et. al, NSDI
70 rogramming model Split MapReduce into untruste ed mapper + trusted reducer Limited set of stock reducers MapReduce program for data mining Data No need to audit Untrusted Trusted Mapper Reduce r Airavat Data Indrajit Roy et. al, NSDI
71 rogramming model Need to confine the mappers! Guarantee: Protect the privacy of data providers MapReduce program for data mining Data No need to audit Untrusted Trusted Mapper Reduce r Airavat Data Indrajit Roy et. al, NSDI
72 hallenge 1: Untruste ed mapper Untrusted t mapper code copies dat dt ta, sends it over the network Peter Chris Map Reduce Meg Data Indrajit Roy et. al, NSDI 2010 Leaksusing system resources 7
73 hallenge 2: Untruste ed mapper Output tof the computation tti is also an information channel Peter Chris Map Reduce Output 1 million if Peter bought Vi*gra Meg Data Indrajit Roy et. al, NSDI
74 iravat mechanisms Mandatory access control Differential privacy Prevent leaks through storage channels like network connections, files Prevent leaks through the output of the computation Output Map Reduce Data Indrajit Roy et. al, NSDI
75 ack to the roadmap What is the programming model? dl? How do we enforce privacy? Leaks through system resources Leaksthrough the output Untrusted mapper + Trusted reducer What computations can be suppor rted in Airavat? Indrajit Roy et. al, NSDI
76 iravat confines the untrusted code Untrusted program MapReduce + DFS SELinux Given by the computation tti provider Add mandatory accesss control (MAC) Add MAC policy Airavat Indrajit Roy et. al, NSDI
77 iravat confines the untrusted code Untrusted program MapReduce + DFS We add d mandatory access control to the MapReduce framework Label input, intermediate values, output Malicious code cannot leak labeled data SELinux Data 1 Data 2 Output Data 3 Access control label MapReduce Indrajit Roy et. al, NSDI
78 iravat confines the untrusted code Untrusted program MapReduce + DFS SELinux SELinux policy to enforce MAC Creates trusted and untrusted domain ns Processes and files are labeled to restrictt interaction Mappers reside in untrusted domain Denied network access, limited file system interaction Indrajit Roy et. al, NSDI
79 ut access control is not enough Labels can prevent the output from been read When can we remove the labels? Peter if (input belongs-to Peter) print (ipad, ) Output leaks the presence of Peter! ipad Tablet PC (ipad, ) ipad Laptop Map phase SUM Indrajit Roy et. al, NSDI 2010 Reduce phase Access control label 7
80 ut access control is not enough Need mechanisms to enforce that the output does not violate an individua al s privacy. Indrajit Roy et. al, NSDI
81 ackground: Differen ntial privacy A mechanism is differentiall ly private if every output is produced with similar probability whether any given input is included or not Cynthia Dwork. Indrajit Differenti Roy et. al, Privacy. NSDI 2010 ICALP
82 ifferential privacy (in ntuition) A mechanism is differentiall ly private if every output is produced with similar probability whether any given input is included or not A B C Output distribution F(x) Cynthia Dwork. Indrajit Differenti Roy et. al, Privacy. NSDI 2010 ICALP
83 ifferential privacy (in ntuition) A mechanism is differentiall ly private if every output is produced with similar probability whether any given input is included or not A B C Similar output distributions F(x) F(x) A B C D Bounded drisk ikfor D if she includes her data! Cynthia Dwork. Indrajit Differenti Roy et. al, Privacy. NSDI 2010 ICALP
84 chieving differential privacy A simple differentially private mec hanism x 1 x n Tell me f(x) () f(x) +noise How much noise should one add? Indrajit Roy et. al, NSDI
85 chieving differential privacy Function sensitivity (intuition): iti Maximum effect of any single input on the output Aim: Need to conceal this effect to preserve privacy Example: Computing the average height of the people in this room has low sensitivity Any single person s height does not affect the final average by too much Calculating the maximum height has high sensitivity Indrajit Roy et. al, NSDI
86 chieving differential privacy Function sensitivity (intuition): iti Maximum effect of any single input on the output Aim: Need to conceal this effect to preserve privacy Example: SUM over input element s drawn from [0, M] X 1 X 2 X 3 X 4 SUM Sensitivity = M Max. effect of any input element is M Indrajit Roy et. al, NSDI
87 chieving differential privacy A simple differentially private mec hanism x 1 x n Tell me f(x) f(x)+lap( (f)) Intuition: Noise needed to mask the effect of a single input (f) = sensitivity Indrajit Roy et. al, NSDI 2010 Lap = Laplace distribution 8
88 ack to the roadmap What is the programming model? dl? Untrusted mapper + Trusted reducer How do we enforce privacy? Leaks through system resources Leaksthrough the output MAC What computations can be suppor rted in Airavat? Indrajit Roy et. al, NSDI
89 Enforcing differential privacy Mapper can be any piece of Java code ( black box ) but Range of mapper outputs must be declared in advance Used to estimate sensitivity (how much does a single input influence the output?) t?) Determines how much noise is added to outputs to ensure differential privacy Example: Consider mapper range [0, M] SUM has the estimated sensitivity of M Indrajit Roy et. al, NSDI
90 nforcing differential privacy Mlii Malicious mappers may output tvalue s outside tid the range If a mapper produces a value outside the range, it is replaced by a value inside the range User not notified otherwise possible information leak Data 1 Data 2 Data 3 Data 4 Range enforcer Mapper Mapper Range enforcer Indrajit Roy et. al, NSDI 2010 Ensures that code isnot more sensitive than declared Reducer Noise 9
91 hat can we comput te? Reducers are responsible for enforcing privacy Add an appropriate amount of random noise to the outputs Reducers must be trusted Sample reducers: SUM, COUNT, THRESHOLD Sufficient to perform data mining algorithms, search log processing, recommender system etc. With trusted mappers, more general computations are possible Use exact sensitivity instead of range based estimates Indrajit Roy et. al, NSDI
92 ample computations Many yqueries es can be don ne with untrusted usted mappers Sum How many ipads were sold today? What is the average score of male students at UT? Output the frequency of security books that sold Threshold morethan 25copies today. Mean others require trusted List all items and their quantity sold Malicious mapper can encode informationn in item names Indrajit Roy et. al, NSDI 2010 mapper code 9
93 evisiting Airavat gua arantees Allows differentially private MapRe educe computations tti Even when the code is untrusted Differential privacy => mathematical bound on information leak What is a safe bound on information leak? Depends onthe context, dataset Not our problem Indrajit Roy et. al, NSDI
94 mplementation detai ls SELinux policy Domains for trusted and untrusted programs Apply restrictions on each domain MapReduce Modifications to support mandatory access control Set of trusted reducers JVM Modifications to enforce mapper independence 500 LoC 450 LoC 5000 LoC Indrajit Roy et. al, NSDI 2010 LoC = Lines of Code 9
95 iravatinbrief in Airavat is a framework kfor privacy preserving MapReduce computations Confines untrusted code First to integrate mandatory accesss control with differential privacy for end to end enforcement Protected Airavat Untruste d Program Indrajit Roy et. al, NSDI
96 References eee Balaji Palanisamy, Aameek Singh, Ling Liu and Bryan Langston, "Cura: A Cost Optimized Model for MapReduce in IPDPS 13 Balaji Palanisamy, Aameek Singh, Ling Liu and Bhushan Jain, "Purlieus: Locality aware Resource Allocation for MapReduce in a Cloud", Proc. of IEEE/ACM Supercomputing (SC' 11). Balaji Palanisamy, Aameek Singh, Nagapramod Mandagere, Gabriel Alatorre and Ling Liu, "VNCache: Bridging Compute and Storage for Big Data in a Cloud in CCGrid Indrajit Roy, Srinath T.V. Setty, Ann Kilzer, Vitaly Shmatikov, Emmett Witchel, Airavat: Security and Privacy for MapReduce, in NSDI Balaji Palanisamy, Aameek Singh, Ling Liu and Bryan Langston, "Cura: A Cost Optimized Model for MapReduce in IEEE Transactions on Parallel and Distributed Systems (TPDS 2014) Indrajit Roy et. al, NSDI
Indrajit Roy, Srinath T.V. Setty, Ann Kilzer, Vitaly Shmatikov, Emmett Witchel The University of Texas at Austin
Airavat: Security and Privacy for MapReduce Indrajit Roy, Srinath T.V. Setty, Ann Kilzer, Vitaly Shmatikov, Emmett Witchel The University of Texas at Austin Computing in the year 201X 2 Data Illusion of
More informationParallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce
Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The
More informationWhere We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344
Where We Are Introduction to Data Management CSE 344 Lecture 22: MapReduce We are talking about parallel query processing There exist two main types of engines: Parallel DBMSs (last lecture + quick review)
More informationIntroduction to MapReduce
Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed
More informationOverview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.
MapReduce and HDFS This presentation includes course content University of Washington Redistributed under the Creative Commons Attribution 3.0 license. All other contents: Overview Why MapReduce? What
More informationIntroduction to Data Management CSE 344
Introduction to Data Management CSE 344 Lecture 26: Parallel Databases and MapReduce CSE 344 - Winter 2013 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Cluster will run in Amazon s cloud (AWS)
More informationMapReduce, Hadoop and Spark. Bompotas Agorakis
MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)
More informationCS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab
CS6030 Cloud Computing Ajay Gupta B239, CEAS Computer Science Department Western Michigan University ajay.gupta@wmich.edu 276-3104 1 Acknowledgements I have liberally borrowed these slides and material
More informationHadoop/MapReduce Computing Paradigm
Hadoop/Reduce Computing Paradigm 1 Large-Scale Data Analytics Reduce computing paradigm (E.g., Hadoop) vs. Traditional database systems vs. Database Many enterprises are turning to Hadoop Especially applications
More informationDistributed File Systems II
Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation
More informationMI-PDB, MIE-PDB: Advanced Database Systems
MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:
More informationDistributed Computations MapReduce. adapted from Jeff Dean s slides
Distributed Computations MapReduce adapted from Jeff Dean s slides What we ve learnt so far Basic distributed systems concepts Consistency (sequential, eventual) Fault tolerance (recoverability, availability)
More informationIntroduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)
Introduction to MapReduce Adapted from Jimmy Lin (U. Maryland, USA) Motivation Overview Need for handling big data New programming paradigm Review of functional programming mapreduce uses this abstraction
More informationCS 345A Data Mining. MapReduce
CS 345A Data Mining MapReduce Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very large Tens to hundreds of terabytes
More informationCS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University
CS 555: DISTRIBUTED SYSTEMS [MAPREDUCE] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Bit Torrent What is the right chunk/piece
More informationAnnouncements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414
Introduction to Database Systems CSE 414 Lecture 17: MapReduce and Spark Announcements Midterm this Friday in class! Review session tonight See course website for OHs Includes everything up to Monday s
More informationCS 61C: Great Ideas in Computer Architecture. MapReduce
CS 61C: Great Ideas in Computer Architecture MapReduce Guest Lecturer: Justin Hsia 3/06/2013 Spring 2013 Lecture #18 1 Review of Last Lecture Performance latency and throughput Warehouse Scale Computing
More informationDistributed Filesystem
Distributed Filesystem 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributing Code! Don t move data to workers move workers to the data! - Store data on the local disks of nodes in the
More informationTITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP
TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop
More informationDatabase Systems CSE 414
Database Systems CSE 414 Lecture 19: MapReduce (Ch. 20.2) CSE 414 - Fall 2017 1 Announcements HW5 is due tomorrow 11pm HW6 is posted and due Nov. 27 11pm Section Thursday on setting up Spark on AWS Create
More informationAnnouncements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm
Announcements HW5 is due tomorrow 11pm Database Systems CSE 414 Lecture 19: MapReduce (Ch. 20.2) HW6 is posted and due Nov. 27 11pm Section Thursday on setting up Spark on AWS Create your AWS account before
More informationIntroduction to Data Management CSE 344
Introduction to Data Management CSE 344 Lecture 24: MapReduce CSE 344 - Fall 2016 1 HW8 is out Last assignment! Get Amazon credits now (see instructions) Spark with Hadoop Due next wed CSE 344 - Fall 2016
More informationBig Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.
Big Data Programming: an Introduction Spring 2015, X. Zhang Fordham Univ. Outline What the course is about? scope Introduction to big data programming Opportunity and challenge of big data Origin of Hadoop
More informationHadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017
Hadoop File System 1 S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y Moving Computation is Cheaper than Moving Data Motivation: Big Data! What is BigData? - Google
More informationCSE 344 MAY 2 ND MAP/REDUCE
CSE 344 MAY 2 ND MAP/REDUCE ADMINISTRIVIA HW5 Due Tonight Practice midterm Section tomorrow Exam review PERFORMANCE METRICS FOR PARALLEL DBMSS Nodes = processors, computers Speedup: More nodes, same data
More informationCloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe
Cloud Programming Programming Environment Oct 29, 2015 Osamu Tatebe Cloud Computing Only required amount of CPU and storage can be used anytime from anywhere via network Availability, throughput, reliability
More informationIntroduction to Data Management CSE 344
Introduction to Data Management CSE 344 Lecture 24: MapReduce CSE 344 - Winter 215 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Due next Thursday evening Will send out reimbursement codes later
More informationBig Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing
Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 Big Data "The world is crazy. But at least it s getting regular analysis." Izabela
More informationA BigData Tour HDFS, Ceph and MapReduce
A BigData Tour HDFS, Ceph and MapReduce These slides are possible thanks to these sources Jonathan Drusi - SCInet Toronto Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing SICS; Yahoo!
More informationProgramming Systems for Big Data
Programming Systems for Big Data CS315B Lecture 17 Including material from Kunle Olukotun Prof. Aiken CS 315B Lecture 17 1 Big Data We ve focused on parallel programming for computational science There
More informationMATE-EC2: A Middleware for Processing Data with Amazon Web Services
MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering Ohio State University * School of Engineering
More informationCloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018
Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster
More informationMapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia
MapReduce Spark Some slides are adapted from those of Jeff Dean and Matei Zaharia What have we learnt so far? Distributed storage systems consistency semantics protocols for fault tolerance Paxos, Raft,
More informationChapter 5. The MapReduce Programming Model and Implementation
Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing
More informationDynamic Cluster Configuration Algorithm in MapReduce Cloud
Dynamic Cluster Configuration Algorithm in MapReduce Cloud Rahul Prasad Kanu, Shabeera T P, S D Madhu Kumar Computer Science and Engineering Department, National Institute of Technology Calicut Calicut,
More informationCPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University
CPSC 426/526 Cloud Computing Ennan Zhai Computer Science Department Yale University Recall: Lec-7 In the lec-7, I talked about: - P2P vs Enterprise control - Firewall - NATs - Software defined network
More informationIntroduction to MapReduce
Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed
More informationCS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014
CS15-319 / 15-619 Cloud Computing Recitation 3 September 9 th & 11 th, 2014 Overview Last Week s Reflection --Project 1.1, Quiz 1, Unit 1 This Week s Schedule --Unit2 (module 3 & 4), Project 1.2 Questions
More informationHDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung
HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per
More informationMapReduce. U of Toronto, 2014
MapReduce U of Toronto, 2014 http://www.google.org/flutrends/ca/ (2012) Average Searches Per Day: 5,134,000,000 2 Motivation Process lots of data Google processed about 24 petabytes of data per day in
More informationClustering Lecture 8: MapReduce
Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data
More informationBig Data Management and NoSQL Databases
NDBI040 Big Data Management and NoSQL Databases Lecture 2. MapReduce Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Framework A programming model
More informationAccelerate Big Data Insights
Accelerate Big Data Insights Executive Summary An abundance of information isn t always helpful when time is of the essence. In the world of big data, the ability to accelerate time-to-insight can not
More informationOutline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins
MapReduce 1 Outline Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins 2 Outline Distributed File System Map-Reduce The Computational Model Map-Reduce
More informationBigData and Map Reduce VITMAC03
BigData and Map Reduce VITMAC03 1 Motivation Process lots of data Google processed about 24 petabytes of data per day in 2009. A single machine cannot serve all the data You need a distributed system to
More informationMapReduce. Stony Brook University CSE545, Fall 2016
MapReduce Stony Brook University CSE545, Fall 2016 Classical Data Mining CPU Memory Disk Classical Data Mining CPU Memory (64 GB) Disk Classical Data Mining CPU Memory (64 GB) Disk Classical Data Mining
More information2013 AWS Worldwide Public Sector Summit Washington, D.C.
2013 AWS Worldwide Public Sector Summit Washington, D.C. EMR for Fun and for Profit Ben Butler Sr. Manager, Big Data butlerb@amazon.com @bensbutler Overview 1. What is big data? 2. What is AWS Elastic
More informationPLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad
More informationDistributed Systems CS6421
Distributed Systems CS6421 Intro to Distributed Systems and the Cloud Prof. Tim Wood v I teach: Software Engineering, Operating Systems, Sr. Design I like: distributed systems, networks, building cool
More informationInforma)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies
Informa)on Retrieval and Map- Reduce Implementa)ons Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies mas4108@louisiana.edu Map-Reduce: Why? Need to process 100TB datasets On 1 node:
More informationMapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat OSDI 2004 Presented by Zachary Bischof Winter '10 EECS 345 Distributed Systems 1 Motivation Summary Example Implementation
More informationCAVA: Exploring Memory Locality for Big Data Analytics in Virtualized Clusters
2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing : Exploring Memory Locality for Big Data Analytics in Virtualized Clusters Eunji Hwang, Hyungoo Kim, Beomseok Nam and Young-ri
More informationParallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014
Parallel Data Processing with Hadoop/MapReduce CS140 Tao Yang, 2014 Overview What is MapReduce? Example with word counting Parallel data processing with MapReduce Hadoop file system More application example
More informationDept. Of Computer Science, Colorado State University
CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [HADOOP/HDFS] Trying to have your cake and eat it too Each phase pines for tasks with locality and their numbers on a tether Alas within a phase, you get one,
More informationMapReduce-style data processing
MapReduce-style data processing Software Languages Team University of Koblenz-Landau Ralf Lämmel and Andrei Varanovich Related meanings of MapReduce Functional programming with map & reduce An algorithmic
More informationLeveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands
Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands Unleash Your Data Center s Hidden Power September 16, 2014 Molly Rector CMO, EVP Product Management & WW Marketing
More informationBatch Processing Basic architecture
Batch Processing Basic architecture in big data systems COS 518: Distributed Systems Lecture 10 Andrew Or, Mike Freedman 2 1 2 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores 3
More informationBIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE
BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE BRETT WENINGER, MANAGING DIRECTOR 10/21/2014 ADURANT APPROACH TO BIG DATA Align to Un/Semi-structured Data Instead of Big Scale out will become Big Greatest
More informationImproving the MapReduce Big Data Processing Framework
Improving the MapReduce Big Data Processing Framework Gistau, Reza Akbarinia, Patrick Valduriez INRIA & LIRMM, Montpellier, France In collaboration with Divyakant Agrawal, UCSB Esther Pacitti, UM2, LIRMM
More informationIntroduction to Map Reduce
Introduction to Map Reduce 1 Map Reduce: Motivation We realized that most of our computations involved applying a map operation to each logical record in our input in order to compute a set of intermediate
More informationDatabase Applications (15-415)
Database Applications (15-415) Hadoop Lecture 24, April 23, 2014 Mohammad Hammoud Today Last Session: NoSQL databases Today s Session: Hadoop = HDFS + MapReduce Announcements: Final Exam is on Sunday April
More informationCloud Computing CS
Cloud Computing CS 15-319 Programming Models- Part III Lecture 6, Feb 1, 2012 Majd F. Sakr and Mohammad Hammoud 1 Today Last session Programming Models- Part II Today s session Programming Models Part
More informationA brief history on Hadoop
Hadoop Basics A brief history on Hadoop 2003 - Google launches project Nutch to handle billions of searches and indexing millions of web pages. Oct 2003 - Google releases papers with GFS (Google File System)
More informationSystems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14
Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2013/14 MapReduce & Hadoop The new world of Big Data (programming model) Overview of this Lecture Module Background Cluster File
More informationLecture 11 Hadoop & Spark
Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem
More informationDistributed Systems 16. Distributed File Systems II
Distributed Systems 16. Distributed File Systems II Paul Krzyzanowski pxk@cs.rutgers.edu 1 Review NFS RPC-based access AFS Long-term caching CODA Read/write replication & disconnected operation DFS AFS
More informationGoogle File System (GFS) and Hadoop Distributed File System (HDFS)
Google File System (GFS) and Hadoop Distributed File System (HDFS) 1 Hadoop: Architectural Design Principles Linear scalability More nodes can do more work within the same time Linear on data size, linear
More informationEmbedded Technosolutions
Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication
More informationECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing
ECE5610/CSC6220 Introduction to Parallel and Distribution Computing Lecture 6: MapReduce in Parallel Computing 1 MapReduce: Simplified Data Processing Motivation Large-Scale Data Processing on Large Clusters
More informationData Analysis Using MapReduce in Hadoop Environment
Data Analysis Using MapReduce in Hadoop Environment Muhammad Khairul Rijal Muhammad*, Saiful Adli Ismail, Mohd Nazri Kama, Othman Mohd Yusop, Azri Azmi Advanced Informatics School (UTM AIS), Universiti
More informationEXTRACT DATA IN LARGE DATABASE WITH HADOOP
International Journal of Advances in Engineering & Scientific Research (IJAESR) ISSN: 2349 3607 (Online), ISSN: 2349 4824 (Print) Download Full paper from : http://www.arseam.com/content/volume-1-issue-7-nov-2014-0
More informationMapReduce and Hadoop
Università degli Studi di Roma Tor Vergata MapReduce and Hadoop Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria Cardellini The reference Big Data stack High-level Interfaces Data Processing
More informationData Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros
Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on
More informationHadoop An Overview. - Socrates CCDH
Hadoop An Overview - Socrates CCDH What is Big Data? Volume Not Gigabyte. Terabyte, Petabyte, Exabyte, Zettabyte - Due to handheld gadgets,and HD format images and videos - In total data, 90% of them collected
More informationPractice and Applications of Data Management CMPSCI 345. Lecture 18: Big Data, Hadoop, and MapReduce
Practice and Applications of Data Management CMPSCI 345 Lecture 18: Big Data, Hadoop, and MapReduce Why Big Data, Hadoop, M-R? } What is the connec,on with the things we learned? } What about SQL? } What
More informationPart A: MapReduce. Introduction Model Implementation issues
Part A: Massive Parallelism li with MapReduce Introduction Model Implementation issues Acknowledgements Map-Reduce The material is largely based on material from the Stanford cources CS246, CS345A and
More informationIntroduction to MapReduce (cont.)
Introduction to MapReduce (cont.) Rafael Ferreira da Silva rafsilva@isi.edu http://rafaelsilva.com USC INF 553 Foundations and Applications of Data Mining (Fall 2018) 2 MapReduce: Summary USC INF 553 Foundations
More informationEnhanced Hadoop with Search and MapReduce Concurrency Optimization
Volume 114 No. 12 2017, 323-331 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu Enhanced Hadoop with Search and MapReduce Concurrency Optimization
More informationThe amount of data increases every day Some numbers ( 2012):
1 The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect
More informationCSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark
CSE 544 Principles of Database Management Systems Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark Announcements HW2 due this Thursday AWS accounts Any success? Feel
More information2/26/2017. The amount of data increases every day Some numbers ( 2012):
The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect to
More informationHedvig as backup target for Veeam
Hedvig as backup target for Veeam Solution Whitepaper Version 1.0 April 2018 Table of contents Executive overview... 3 Introduction... 3 Solution components... 4 Hedvig... 4 Hedvig Virtual Disk (vdisk)...
More informationKey aspects of cloud computing. Towards fuller utilization. Two main sources of resource demand. Cluster Scheduling
Key aspects of cloud computing Cluster Scheduling 1. Illusion of infinite computing resources available on demand, eliminating need for up-front provisioning. The elimination of an up-front commitment
More informationCloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018
Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning
More informationTopics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples
Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?
More informationImproving Hadoop MapReduce Performance on Supercomputers with JVM Reuse
Thanh-Chung Dao 1 Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse Thanh-Chung Dao and Shigeru Chiba The University of Tokyo Thanh-Chung Dao 2 Supercomputers Expensive clusters Multi-core
More informationHadoop and HDFS Overview. Madhu Ankam
Hadoop and HDFS Overview Madhu Ankam Why Hadoop We are gathering more data than ever Examples of data : Server logs Web logs Financial transactions Analytics Emails and text messages Social media like
More informationA SURVEY ON SCHEDULING IN HADOOP FOR BIGDATA PROCESSING
Journal homepage: www.mjret.in ISSN:2348-6953 A SURVEY ON SCHEDULING IN HADOOP FOR BIGDATA PROCESSING Bhavsar Nikhil, Bhavsar Riddhikesh,Patil Balu,Tad Mukesh Department of Computer Engineering JSPM s
More informationMAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti
International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department
More informationEsgynDB Enterprise 2.0 Platform Reference Architecture
EsgynDB Enterprise 2.0 Platform Reference Architecture This document outlines a Platform Reference Architecture for EsgynDB Enterprise, built on Apache Trafodion (Incubating) implementation with licensed
More informationCloud Computing & Visualization
Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International
More informationCS60021: Scalable Data Mining. Sourangshu Bhattacharya
CS60021: Scalable Data Mining Sourangshu Bhattacharya In this Lecture: Outline: HDFS Motivation HDFS User commands HDFS System architecture HDFS Implementation details Sourangshu Bhattacharya Computer
More informationAbstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight
ESG Lab Review InterSystems Data Platform: A Unified, Efficient Data Platform for Fast Business Insight Date: April 218 Author: Kerry Dolan, Senior IT Validation Analyst Abstract Enterprise Strategy Group
More informationFusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic
WHITE PAPER Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic Western Digital Technologies, Inc. 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Executive
More informationHadoop Distributed File System(HDFS)
Hadoop Distributed File System(HDFS) Bu eğitim sunumları İstanbul Kalkınma Ajansı nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı kapsamında yürütülmekte olan TR10/16/YNY/0036 no lu İstanbul
More information/ Cloud Computing. Recitation 3 Sep 13 & 15, 2016
15-319 / 15-619 Cloud Computing Recitation 3 Sep 13 & 15, 2016 1 Overview Administrative Issues Last Week s Reflection Project 1.1, OLI Unit 1, Quiz 1 This Week s Schedule Project1.2, OLI Unit 2, Module
More informationThe Google File System. Alexandru Costan
1 The Google File System Alexandru Costan Actions on Big Data 2 Storage Analysis Acquisition Handling the data stream Data structured unstructured semi-structured Results Transactions Outline File systems
More informationFLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568
FLAT DATACENTER STORAGE Paper-3 Presenter-Pratik Bhatt fx6568 FDS Main discussion points A cluster storage system Stores giant "blobs" - 128-bit ID, multi-megabyte content Clients and servers connected
More informationMap-Reduce. Marco Mura 2010 March, 31th
Map-Reduce Marco Mura (mura@di.unipi.it) 2010 March, 31th This paper is a note from the 2009-2010 course Strumenti di programmazione per sistemi paralleli e distribuiti and it s based by the lessons of
More informationIntegrityMR: Integrity Assurance Framework for Big Data Analytics and Management
IntegrityMR: Integrity Assurance Framework for Big Data Analytics and Management Applications Yongzhi Wang, Jinpeng Wei Florida International University Mudhakar Srivatsa IBM T.J. Watson Research Center
More information