"Big Data" Open Source Systems. CS347: Map-Reduce & Pig. Motivation for Map-Reduce. Building Text Index - Part II. Building Text Index - Part I
|
|
- Chad Jordan
- 6 years ago
- Views:
Transcription
1 "Big Data" Open Source Systems CS347: Map-Reduce & Pig Hector Garcia-Molina Stanford University Infrastructure for distributed data computations Map-Reduce, S4, Hyracks, Pregel [Storm, Mupet] Components MemCachedD, ZooKeeper, Kestrel Data services Pig, F1 Cassandra, H-Base, Big Table [Hive] 1 2 Motivation for Map-Reduce Another example: Asymmetric fragment + replicate join Recall one of our sort strategies: R1 ko R2 Local sort R 1 Local sort R 2 Result Ra Rb R1 R2 Local join S S Sa Sb process data & partition k1 R3 Local sort additional processing R 3 f partition process data & partition R3 Result additional processing S union CS 347 Notes 03 CS 347 Notes Building Text Index - Part I Building Text Index - Part II Page stream original Map-Reduce application... 2 rat 3 1 cat Loading rat Tokenizing Sorting FLUSHING Intermediate runs Disk 5 (ant, 5) (cat, 4) Merge Intermediate Runs (ant, 5) (cat, 4) (, 4) (, 5) (eel, 6) (, 4) (, 5) (eel, 6) (ant: 2) (cat: 2,4) (: 1,2,3,4,5) (eel: 6) (rat: 1, 3) Final index 6 1
2 Generalizing: Map-Reduce Generalizing: Map-Reduce Page stream 2 rat 3 1 cat Loading rat Map Tokenizing Sorting FLUSHING Intermediate runs Disk 7 Merge (ant, 5) (cat, 4) Intermediate Runs (ant, 5) (cat, 4) (, 4) (, 5) (, 4) (eel, 6) (, 5) (eel, 6) (ant: 2) (cat: 2,4) (: 1,2,3,4,5) (eel: 6) (rat: 1, 3) Reduce Final index 8 Map Reduce References Input: R={r 1, r 2,...r n }, functions M, R M(r i ) { [k 1, v 1 ], [k 2, v 2 ],.. } R(k i, valset) [k i, valset ] Let S={ [k, v] [k, v] M(r) for some r R } Let K = {k [k,v] S, for any v } Let G(k) = { v [k, v] S } G is bag Output = { [k, T] k K, T=R(k, G(k)) } S is bag MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, available at Pig Latin: A Not-So-Foreign Language for Data Processing, Christopher Olston, Benjamin Reedy, Utkarsh Srivastavava, Ravi Kumar, Andrew Tomkins, available at Example: Counting Word Occurrences Example: Counting Word Occurrences map(string doc, String value); // doc is document name Why does map have 2 parameters? // value is document content for each word w in value: EmitIntermediate(w, 1 ); Example: map(doc, cat cat bat ) emits [cat 1], [ 1], [cat 1], [bat 1], [ 1] reduce(string key, Iterator values); // key is a word // values is a list of counts int result = 0; for each v in values: result += ParseInt(v) Emit(AsString(result)); Example: reduce(, ) emits 4 11 should emit (, 4)?? 12 2
3 Google MR Overview Implementation Issues Combine function File system Partition of input, keys Failures Backup tasks Ordering of results Combine Function Distributed File System [cat 1], [cat 1], [cat 1]... [ 1], [ 1]... all data transfers are through distributed file system must be able to access any part of input file reduce must be able to access local disks on map s any must be able to write its part of answer; answer is left as distributed file Combine is like a local reduce applied before distribution: [cat 3]... [ 2] Partition of input, keys Failures How many s, partitions of input file? How many splits? How many s? Best to have many splits per : Improves load balance; if fails, easier to spread its tasks Should s be assigned to splits near them? Similar questions for reduce s Distributed implementation should produce same output as would have been produced by a nonfaulty sequential execution of the program. General strategy: Master detects failures, and has work re-done by another. split j ok? redo j master
4 Backup Tasks Straggler is a machine that takes unusually long (e.g., bad disk) to finish its work. A straggler can delay final completion. When task is close to finishing, master schedules backup executions for remaining tasks. Must be able to eliminate redundant results Ordering of Results Final result (at each node) is in key order also in key order: [k 1, v 1 ] [k 3, v 3 ] [k 1, T 1 ] [k 2, T 2 ] [k 3, T 3 ] [k 4, T 4 ] Example: Sorting Records Other Issues 5 e 3 c 6 f 2 b 1 a 6 f* W1 W2 3 c 5 e 6 f 1 a 2 b 6 f* W5 1 a 3 c 5 e 7 g 9 i one or two records for k=6? Skipping bad records Debugging 8 h 4 d 9 i 7 g W3 7 g 9 i 4 d 8 h W6 2 b 4 d 6 f 6 f* 8 h Map: CS347 extract k, output [k, record] Notes 09 Reduce: Do nothing! MR Claimed Advantages MR Possible Disadvantages Model easy to use, hides details of parallelization, fault recovery Many problems expressible in MR framework Scales to thousands of machines 1-input 2-stage data flow rigid, hard to adapt to other scenarios Custom code needs to be written even for the most common operations, e.g., projection and filtering Opaque nature of map, reduce functions impedes optimization
5 Questions Can MR be made more declarative? How can we perform joins? How can we perform approximate grouping? example: for all keys that are similar reduce all values for those keys Additional Topics Hadoop: open-source Map-Reduce system Pig: Yahoo system that builds on MR but is more declarative Pig & Pig Latin A layer on top of map-reduce (Hadoop) Pig is the system Pig Latin is the query language Pig Latin is a hybrid between: high-level declarative query language in the spirit of SQL low-level, procedural programming à la mapreduce. Example Table urls: (url, category, pagerank) Find, for each sufficiently large category, the average pagerank of high-pagerank urls in that category. In SQL: SELECT category, AVG(pagerank) FROM urls WHERE pagerank > 0.2 GROUP BY category HAVING COUNT(*) > Example in Pig Latin good_urls = FILTER urls BY pagerank > 0.2; SELECT category, AVG(pagerank) FROM urls WHERE pagerank > 0.2 GROUP BY category HAVING COUNT(*) > 10 6 In Pig Latin: good_urls = FILTER urls BY pagerank > 0.2; groups = GROUP good_urls BY category; big_groups = FILTER groups BY COUNT(good_urls)>10 6 ; output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank); urls: url, category, pagerank z.cnn.com.com 0.9 y.yale.edu.edu 0.5 w.uc.edu.edu 0.1 x.nyt.com.com 0.8 y.ut.edu.edu 0.6 w.wh.gov.gov 0.7 good_urls: url, category, pagerank z.cnn.com.com 0.9 y.yale.edu.edu 0.5 x.nyt.com.com 0.8 y.ut.edu.edu 0.6 w.wh.gov.gov
6 groups = GROUP good_urls BY category; good_urls: url, category, pagerank z.cnn.com.com 0.9 y.yale.edu.edu 0.5 x.nyt.com.com 0.8 y.ut.edu.edu 0.6 w.wh.gov.gov 0.7 groups: category, good_urls.com {(z.cnn.com,.com, 0.9) (x.nyt.com,.com, 0.8)}.edu {(y.yale.edu,.edu, 0.5) (y.ut.edu,.edu, 0.6)}.gov {(w.wh.gov,.gov,.07)} big_groups = FILTER groups BY COUNT(good_urls)>1; groups: category, good_urls.com {(z.cnn.com,.com, 0.9) (x.nyt.com,.com, 0.8)}.edu {(y.yale.edu,.edu, 0.5) (y.ut.edu,.edu, 0.6)}.gov {(w.wh.gov,.gov,.07)} big_groups: category, good_urls.com {(z.cnn.com,.com, 0.9) (x.nyt.com,.com, 0.8)}.edu {(y.yale.edu,.edu, 0.5) (y.ut.edu,.edu, 0.6)} output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank); big_groups: category, good_urls.com {(z.cnn.com,.com, 0.9) (x.nyt.com,.com, 0.8)}.edu {(y.yale.edu,.edu, 0.5) (y.ut.edu,.edu. 0.6)} output: category, good_urls.com 0.85.edu Features Similar to specifying a query execution plan (i.e., a dataflow graph), thereby making it easier for programmers to understand and control how their data processing task is executed. Support for a flexible, fully nested data model Extensive support for user-defined functions Ability to operate over plain input files without any schema information. Novel debugging environment useful when dealing with enormous data sets. 34 Execution Control: Good or Bad? User Defined Functions Example: spam_urls = FILTER urls BY isspam(url); culprit_urls = FILTER spam_urls BY pagerank>0.8; Should system re-order filters? Example groups = GROUP urls BY category; output = FOREACH groups GENERATE category, top10(urls); should be groups.url?.gov {(x.fbi.gov,.gov, 0.7)...}.edu {(y.yale.edu,.edu, 0.5)...} UDF top10 can return scalar or set.com {(z.cnn.com,.com, 0.9)...} 35.gov {(fbi.gov) (cia.gov)...}.edu {(yale.edu)...}.com {(cnn.com) (ibm.com)...} 36 6
7 Data Model Expressions in Pig Latin Atom, e.g., `alice' Tuple, e.g., (`alice', `lakers') Bag, e.g., { (`alice', `lakers') (`alice', (`ipod', `apple')} Map, e.g., [ `fan of' { (`lakers') (`ipod') } `age 20 ] Should be (1) + (2) Note: Bags can currently only hold tuples. So {1, 2, 3} is stored as {(1) (2) (3)} 37 See flatten examples ahead 38 Specifying Input Data For Each handle for future use queries = LOAD `query_log.txt' USING myload() AS (userid, querystring, timestamp); input file custom deserializer output schema expanded_queries = FOREACH queries GENERATE userid, expandquery(querystring); See example next slide Note each tuple is processed independently; good for parallelism To remove one level of nesting: expanded_queries = FOREACH queries GENERATE userid, FLATTEN(expandQuery(queryString)); ForEach and Flattening lakers rumors is a single string value Flattening Example (Fill In) X A B C a1 {(b1, b2) (b3, b4) (b5)} {(c1) (c2)} a2 {(b6, (b7,b8))} {(c3) (c4)} plus userid Y = FOREACH X GENERATE A, FLATTEN(B), C
8 Flattening Example (Fill In) Flattening Example a1 b1, b2 {(c1) (c2)} X A B C a1 b3, b4 {(c1) (c2)} a1 b5 {(c1) (c2)} a2 b6 {(c3) (c4)} a2 (b7, b8) {(c3) (c4)} Y = FOREACH X GENERATE A, FLATTEN(B), C Z = FOREACH Y GENERATE A, B, FLATTEN(C) a1 {(b1, b2) (b3, b4) (b5)} {(c1) (c2)} a2 {(b6, (b7,b8))} {(c3) (c4)} a1 b1 b2 {(c1) (c2)} a1 b3 b4 {(c1) (c2)} a1 b5 {(c1) (c2)} a2 b6 (b7, b8) {(c3) (c4)} Note first tuple is (a1, b1, b2, {(c1)(c2)}) Y = FOREACH X GENERATE A, FLATTEN(B), C Is Z=Z where Z = FOREACH X GENERATE A, FLATTEN(B), FLATTEN(C)? 43 Flatten is not recursive Note attribute naming gets complicated. For example, $2 for first tuple is b2; for third tuple it is {(c1)(c2)}. 44 Flattening Example Filter a1 b1, b2 {(c1) (c2)} a1 b3, b4 {(c1) (c2)} a1 b5 {(c1) (c2)} a2 b6 {(c3) (c4)} a2 (b7, b8) {(c3) (c4)} a1 b1 b2 c1 a1 b1 b2 c2 a1 b3 b4 c1 a1 b3 b4 c2 a1 b5 c1 a1 b5 c2 a2 b6 (b7, b8) c3 a2 b6 (b7, b8) c4 Y = FOREACH X GENERATE A, FLATTEN(B), C Z = FOREACH Y GENERATE A, B, FLATTEN(C) Note that Z=Z where Z = FOREACH X GENERATE A, FLATTEN(B), FLATTEN(C) 45 real_queries = FILTER queries BY userid neq `bot'; real_queries = FILTER queries BY NOT isbot(userid); UDF function 46 Co-Group Two data sets for example: results: (querystring, url, position) revenue: (querystring, adslot, amount) grouped_data = COGROUP results BY querystring, revenue BY querystring; url_revenues = FOREACH grouped_data GENERATE FLATTEN(distributeRevenue(results, revenue)); Co-Group more flexible than SQL JOIN CoGroup vs Join
9 Group (Simple CoGroup) grouped_revenue = GROUP revenue BY querystring; query_revenues = FOREACH grouped_revenue GENERATE querystring, SUM(revenue.amount) AS totalrevenue; CoGroup Example 1 Z1 = GROUP X BY A Z1 A X CoGroup Example 1 CoGroup Example 2 Z1 = GROUP X BY A Z1 A X 1 {(1, 1, c1) (1, 1, c2)} 2 {(2, 2, c3) (2, 2, c4)} Z2 = GROUP X BY (A, B) Z1? X Syntax not in paper but being added CoGroup Example 2 CoGroup Example 3 Z2 = GROUP X BY (A, B) Syntax not in paper but being added Z3 = COGROUP X BY A, Y BY A Z1 A/B? X Z1 A X Y (1, 1) {(1, 1, c1) (1, 1, c2)} (2, 2) {(2, 2, c3) (2, 2, c4)}
10 CoGroup Example 3 CoGroup Example 4 Z3 = COGROUP X BY A, Y BY A Z4 = COGROUP X BY A, Y BY B Z1 A X Y Z1 A X Y 1 {(1, 1, c1) (1, 1, c2)} {(1, 1, d1) (1, 2, d2)} 2 {(2, 2, c3) (2, 2, c4)} {(2, 1, d3) (2, 2, d4)} CoGroup Example 4 CoGroup With Function Call? X A B (2,2) a (1,3) b (2,2) c Y = GROUP X BY A Z = GROUP X BY SUM(A) Adds integers in tuple Z4 = COGROUP X BY A, Y BY B Y A X Z? X Z1 A X Y 1 {(1, 1, c1) (1, 1, c2)} {(1, 1, d1) (2, 1, d3)} 2 {(2, 2, c3) (2, 2, c4)} {(1, 2, d2) (2, 2, d4)} CoGroup With Function Call? Pig Latin Join X A B (2,2) a (1,3) b (2,2) c Y A X (2,2) {((2,2), a) ((2,2), c)} (1,3) {((1,3), b)} Y = GROUP X BY A Z = GROUP X BY SUM(A) Adds integers in tuple Z SUM(A)/A? X 4 {((2,2), a) ((1,3), b) ((2,2), c)} join_result = JOIN results BY querystring, revenue BY querystring; Shorthand for: temp_var = COGROUP results BY querystring, revenue BY querystring; join_result = FOREACH temp_var GENERATE FLATTEN(results), FLATTEN(revenue);
11 MapReduce in Pig Latin map_result = FOREACH input GENERATE FLATTEN(map(*)); key_groups = GROUP map_result BY $0; output = FOREACH key_groups GENERATE reduce(*); key is first attribute Store To materialize result in a file: STORE query_revenues INTO `myoutput' USING mystore(); output file custom serializer all attributes Hadoop HDFS: Hadoop file system How to use Hadoop, examples 63 11
Lecture 23: Supplementary slides for Pig Latin. Friday, May 28, 2010
Lecture 23: Supplementary slides for Pig Latin Friday, May 28, 2010 1 Outline Based entirely on Pig Latin: A not-so-foreign language for data processing, by Olston, Reed, Srivastava, Kumar, and Tomkins,
More informationSection 8. Pig Latin
Section 8 Pig Latin Outline Based on Pig Latin: A not-so-foreign language for data processing, by Olston, Reed, Srivastava, Kumar, and Tomkins, 2008 2 Pig Engine Overview Data model = loosely typed nested
More informationIntroduction to Database Systems CSE 444. Lecture 22: Pig Latin
Introduction to Database Systems CSE 444 Lecture 22: Pig Latin Outline Based entirely on Pig Latin: A not-so-foreign language for data processing, by Olston, Reed, Srivastava, Kumar, and Tomkins, 2008
More informationMotivation: Building a Text Index. CS 347 Distributed Databases and Transaction Processing Distributed Data Processing Using MapReduce
Motivation: Building a Text Index CS 347 Distributed Databases and Transaction Processing Distributed Data Processing Using MapReduce Hector Garcia-Molina Zoltan Gyongyi Web page stream 1 rat dog 2 dog
More informationPig Latin. Dominique Fonteyn Wim Leers. Universiteit Hasselt
Pig Latin Dominique Fonteyn Wim Leers Universiteit Hasselt Pig Latin...... is an English word game in which we place the rst letter of a word at the end and add the sux -ay. Pig Latin becomes igpay atinlay
More informationPig Latin: A Not-So-Foreign Language for Data Processing
Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins (Yahoo! Research) Presented by Aaron Moss (University of Waterloo)
More informationIntroduction to Data Management CSE 344
Introduction to Data Management CSE 344 Lecture 27: Map Reduce and Pig Latin CSE 344 - Fall 214 1 Announcements HW8 out now, due last Thursday of the qtr You should have received AWS credit code via email.
More informationOutline. MapReduce Data Model. MapReduce. Step 2: the REDUCE Phase. Step 1: the MAP Phase 11/29/11. Introduction to Data Management CSE 344
Outline Introduction to Data Management CSE 344 Review of MapReduce Introduction to Pig System Pig Latin tutorial Lecture 23: Pig Latin Some slides are courtesy of Alan Gates, Yahoo!Research 1 2 MapReduce
More information1.2 Why Not Use SQL or Plain MapReduce?
1. Introduction The Pig system and the Pig Latin programming language were first proposed in 2008 in a top-tier database research conference: Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi
More informationData Management in the Cloud PIG LATIN AND HIVE. The Google Stack. Sawzall. Map/Reduce. Bigtable GFS
Data Management in the Cloud PIG LATIN AND HIVE 191 The Google Stack Sawzall Map/Reduce Bigtable GFS 192 The Hadoop Stack SQUEEQL! ZZZZZQL! EXCUSEZ- MOI?!? Pig/Pig Latin Hive REALLY!?! Hadoop HDFS At your
More informationData-intensive computing systems
Data-intensive computing systems High-Level Languages University of Verona Computer Science Department Damiano Carra Acknowledgements! Credits Part of the course material is based on slides provided by
More informationPig Latin: A Not-So-Foreign Language for Data Processing
Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olston Yahoo! Research Ravi Kumar Yahoo! Research Benjamin Reed Yahoo! Research Andrew Tomkins Yahoo! Research Utkarsh Srivastava Yahoo!
More informationGoing beyond MapReduce
Going beyond MapReduce MapReduce provides a simple abstraction to write distributed programs running on large-scale systems on large amounts of data MapReduce is not suitable for everyone MapReduce abstraction
More informationWhere We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344
Where We Are Introduction to Data Management CSE 344 Lecture 22: MapReduce We are talking about parallel query processing There exist two main types of engines: Parallel DBMSs (last lecture + quick review)
More informationMapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat OSDI 2004 Presented by Zachary Bischof Winter '10 EECS 345 Distributed Systems 1 Motivation Summary Example Implementation
More informationIntroduction to Data Management CSE 344
Introduction to Data Management CSE 344 Lecture 24: MapReduce CSE 344 - Winter 215 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Due next Thursday evening Will send out reimbursement codes later
More informationThe MapReduce Abstraction
The MapReduce Abstraction Parallel Computing at Google Leverages multiple technologies to simplify large-scale parallel computations Proprietary computing clusters Map/Reduce software library Lots of other
More informationPart A: MapReduce. Introduction Model Implementation issues
Part A: Massive Parallelism li with MapReduce Introduction Model Implementation issues Acknowledgements Map-Reduce The material is largely based on material from the Stanford cources CS246, CS345A and
More informationL22: SC Report, Map Reduce
L22: SC Report, Map Reduce November 23, 2010 Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance Google version = Map Reduce; Hadoop = Open source
More informationIntroduction to Data Management CSE 344
Introduction to Data Management CSE 344 Lecture 26: Parallel Databases and MapReduce CSE 344 - Winter 2013 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Cluster will run in Amazon s cloud (AWS)
More informationMapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia
MapReduce Spark Some slides are adapted from those of Jeff Dean and Matei Zaharia What have we learnt so far? Distributed storage systems consistency semantics protocols for fault tolerance Paxos, Raft,
More informationCSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark
CSE 544 Principles of Database Management Systems Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark Announcements HW2 due this Thursday AWS accounts Any success? Feel
More informationDistributed Computations MapReduce. adapted from Jeff Dean s slides
Distributed Computations MapReduce adapted from Jeff Dean s slides What we ve learnt so far Basic distributed systems concepts Consistency (sequential, eventual) Fault tolerance (recoverability, availability)
More informationThe Hadoop Stack, Part 1 Introduction to Pig Latin. CSE Cloud Computing Fall 2018 Prof. Douglas Thain University of Notre Dame
The Hadoop Stack, Part 1 Introduction to Pig Latin CSE 40822 Cloud Computing Fall 2018 Prof. Douglas Thain University of Notre Dame Three Case Studies Workflow: Pig Latin A dataflow language and execution
More informationAnnouncements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414
Introduction to Database Systems CSE 414 Lecture 17: MapReduce and Spark Announcements Midterm this Friday in class! Review session tonight See course website for OHs Includes everything up to Monday s
More information1. Introduction to MapReduce
Processing of massive data: MapReduce 1. Introduction to MapReduce 1 Origins: the Problem Google faced the problem of analyzing huge sets of data (order of petabytes) E.g. pagerank, web access logs, etc.
More informationDatabase Systems CSE 414
Database Systems CSE 414 Lecture 19: MapReduce (Ch. 20.2) CSE 414 - Fall 2017 1 Announcements HW5 is due tomorrow 11pm HW6 is posted and due Nov. 27 11pm Section Thursday on setting up Spark on AWS Create
More informationAnnouncements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm
Announcements HW5 is due tomorrow 11pm Database Systems CSE 414 Lecture 19: MapReduce (Ch. 20.2) HW6 is posted and due Nov. 27 11pm Section Thursday on setting up Spark on AWS Create your AWS account before
More informationCS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University
CS 555: DISTRIBUTED SYSTEMS [MAPREDUCE] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Bit Torrent What is the right chunk/piece
More informationParallel Nested Loops
Parallel Nested Loops For each tuple s i in S For each tuple t j in T If s i =t j, then add (s i,t j ) to output Create partitions S 1, S 2, T 1, and T 2 Have processors work on (S 1,T 1 ), (S 1,T 2 ),
More informationParallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011
Parallel Nested Loops Parallel Partition-Based For each tuple s i in S For each tuple t j in T If s i =t j, then add (s i,t j ) to output Create partitions S 1, S 2, T 1, and T 2 Have processors work on
More informationLarge-Scale GPU programming
Large-Scale GPU programming Tim Kaldewey Research Staff Member Database Technologies IBM Almaden Research Center tkaldew@us.ibm.com Assistant Adjunct Professor Computer and Information Science Dept. University
More informationMapReduce. Kiril Valev LMU Kiril Valev (LMU) MapReduce / 35
MapReduce Kiril Valev LMU valevk@cip.ifi.lmu.de 23.11.2013 Kiril Valev (LMU) MapReduce 23.11.2013 1 / 35 Agenda 1 MapReduce Motivation Definition Example Why MapReduce? Distributed Environment Fault Tolerance
More informationParallel Programming Concepts
Parallel Programming Concepts MapReduce Frank Feinbube Source: MapReduce: Simplied Data Processing on Large Clusters; Dean et. Al. Examples for Parallel Programming Support 2 MapReduce 3 Programming model
More informationParallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce
Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The
More informationParallel Computing: MapReduce Jin, Hai
Parallel Computing: MapReduce Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology ! MapReduce is a distributed/parallel computing framework introduced by Google
More informationBig Data Management and NoSQL Databases
NDBI040 Big Data Management and NoSQL Databases Lecture 2. MapReduce Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Framework A programming model
More informationSystems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15
Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2014/15 Hadoop Evolution and Ecosystem Hadoop Map/Reduce has been an incredible success, but not everybody is happy with it 3 DB
More informationCSE 344 MAY 2 ND MAP/REDUCE
CSE 344 MAY 2 ND MAP/REDUCE ADMINISTRIVIA HW5 Due Tonight Practice midterm Section tomorrow Exam review PERFORMANCE METRICS FOR PARALLEL DBMSS Nodes = processors, computers Speedup: More nodes, same data
More informationIntroduction to Data Management CSE 344
Introduction to Data Management CSE 344 Lecture 24: MapReduce CSE 344 - Fall 2016 1 HW8 is out Last assignment! Get Amazon credits now (see instructions) Spark with Hadoop Due next wed CSE 344 - Fall 2016
More informationCS 102. Big Data. Spring Big Data Platforms
CS 102 Big Data Spring 2016 Big Data Platforms How Big is Big? The Data Data Sets 1000 2 (5.3 MB) Complete works of Shakespeare (text) 1000 3 (~5-500 GB) Your data 1000 4 (10 TB) Library of Congress (text)
More informationMapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia
MapReduce & Resilient Distributed Datasets Yiqing Hua, Mengqi(Mandy) Xia Outline - MapReduce: - - Resilient Distributed Datasets (RDD) - - Motivation Examples The Design and How it Works Performance Motivation
More informationScaling Up 1 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. Hadoop, Pig
CSE 6242 / CX 4242 Scaling Up 1 Hadoop, Pig Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos, Le
More informationThe MapReduce Framework
The MapReduce Framework In Partial fulfilment of the requirements for course CMPT 816 Presented by: Ahmed Abdel Moamen Agents Lab Overview MapReduce was firstly introduced by Google on 2004. MapReduce
More informationDistributed Data Management Summer Semester 2013 TU Kaiserslautern
Distributed Data Management Summer Semester 2013 TU Kaiserslautern Dr.- Ing. Sebas4an Michel smichel@mmci.uni- saarland.de Distributed Data Management, SoSe 2013, S. Michel 1 Lecture 4 PIG/HIVE Distributed
More informationPrinciples of Data Management. Lecture #16 (MapReduce & DFS for Big Data)
Principles of Data Management Lecture #16 (MapReduce & DFS for Big Data) Instructor: Mike Carey mjcarey@ics.uci.edu Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Today s News Bulletin
More informationThe Pig Experience. A. Gates et al., VLDB 2009
The Pig Experience A. Gates et al., VLDB 2009 Why not Map-Reduce? Does not directly support complex N-Step dataflows All operations have to be expressed using MR primitives Lacks explicit support for processing
More informationDeclarative MapReduce 10/29/2018 1
Declarative Reduce 10/29/2018 1 Reduce Examples Filter Aggregate Grouped aggregated Reduce Reduce Equi-join Reduce Non-equi-join Reduce 10/29/2018 2 Declarative Languages Describe what you want to do not
More informationMap Reduce Group Meeting
Map Reduce Group Meeting Yasmine Badr 10/07/2014 A lot of material in this presenta0on has been adopted from the original MapReduce paper in OSDI 2004 What is Map Reduce? Programming paradigm/model for
More informationMapReduce: Algorithm Design for Relational Operations
MapReduce: Algorithm Design for Relational Operations Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec Projection π Projection in MapReduce Easy Map over tuples, emit
More informationAdvanced Data Management Technologies
ADMT 2017/18 Unit 16 J. Gamper 1/53 Advanced Data Management Technologies Unit 16 MapReduce J. Gamper Free University of Bozen-Bolzano Faculty of Computer Science IDSE Acknowledgements: Much of the information
More informationPig A language for data processing in Hadoop
Pig A language for data processing in Hadoop Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Apache Pig: Introduction Tool for querying data on Hadoop
More information6.830 Lecture Spark 11/15/2017
6.830 Lecture 19 -- Spark 11/15/2017 Recap / finish dynamo Sloppy Quorum (healthy N) Dynamo authors don't think quorums are sufficient, for 2 reasons: - Decreased durability (want to write all data at
More informationMapReduce: Simplified Data Processing on Large Clusters 유연일민철기
MapReduce: Simplified Data Processing on Large Clusters 유연일민철기 Introduction MapReduce is a programming model and an associated implementation for processing and generating large data set with parallel,
More informationMapReduce: A Programming Model for Large-Scale Distributed Computation
CSC 258/458 MapReduce: A Programming Model for Large-Scale Distributed Computation University of Rochester Department of Computer Science Shantonu Hossain April 18, 2011 Outline Motivation MapReduce Overview
More informationConcurrency for data-intensive applications
Concurrency for data-intensive applications Dennis Kafura CS5204 Operating Systems 1 Jeff Dean Sanjay Ghemawat Dennis Kafura CS5204 Operating Systems 2 Motivation Application characteristics Large/massive
More informationCloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe
Cloud Programming Programming Environment Oct 29, 2015 Osamu Tatebe Cloud Computing Only required amount of CPU and storage can be used anytime from anywhere via network Availability, throughput, reliability
More informationBatch Processing Basic architecture
Batch Processing Basic architecture in big data systems COS 518: Distributed Systems Lecture 10 Andrew Or, Mike Freedman 2 1 2 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores 3
More informationIntroduction to Map Reduce
Introduction to Map Reduce 1 Map Reduce: Motivation We realized that most of our computations involved applying a map operation to each logical record in our input in order to compute a set of intermediate
More informationHadoop ecosystem. Nikos Parlavantzas
1 Hadoop ecosystem Nikos Parlavantzas Lecture overview 2 Objective Provide an overview of a selection of technologies in the Hadoop ecosystem Hadoop ecosystem 3 Hadoop ecosystem 4 Outline 5 HBase Hive
More informationCS 61C: Great Ideas in Computer Architecture. MapReduce
CS 61C: Great Ideas in Computer Architecture MapReduce Guest Lecturer: Justin Hsia 3/06/2013 Spring 2013 Lecture #18 1 Review of Last Lecture Performance latency and throughput Warehouse Scale Computing
More informationCSE 414: Section 7 Parallel Databases. November 8th, 2018
CSE 414: Section 7 Parallel Databases November 8th, 2018 Agenda for Today This section: Quick touch up on parallel databases Distributed Query Processing In this class, only shared-nothing architecture
More informationMI-PDB, MIE-PDB: Advanced Database Systems
MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:
More informationInnovatus Technologies
HADOOP 2.X BIGDATA ANALYTICS 1. Java Overview of Java Classes and Objects Garbage Collection and Modifiers Inheritance, Aggregation, Polymorphism Command line argument Abstract class and Interfaces String
More informationPLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad
More informationCSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA
CSE 3302 Lecture 11: Map/Reduce 7 October 2010 Nate Nystrom UTA 378,000 results in 0.17 seconds including images and video communicates with 1000s of machines web server index servers document servers
More informationDistributed computing: index building and use
Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput
More informationPractice and Applications of Data Management CMPSCI 345. Lecture 18: Big Data, Hadoop, and MapReduce
Practice and Applications of Data Management CMPSCI 345 Lecture 18: Big Data, Hadoop, and MapReduce Why Big Data, Hadoop, M-R? } What is the connec,on with the things we learned? } What about SQL? } What
More informationMapReduce, Hadoop and Spark. Bompotas Agorakis
MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)
More informationCISC 7610 Lecture 2b The beginnings of NoSQL
CISC 7610 Lecture 2b The beginnings of NoSQL Topics: Big Data Google s infrastructure Hadoop: open google infrastructure Scaling through sharding CAP theorem Amazon s Dynamo 5 V s of big data Everyone
More informationHadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop
Hadoop Open Source Projects Hadoop is supplemented by an ecosystem of open source projects Oozie 25 How to Analyze Large Data Sets in Hadoop Although the Hadoop framework is implemented in Java, MapReduce
More informationDatabase System Architectures Parallel DBs, MapReduce, ColumnStores
Database System Architectures Parallel DBs, MapReduce, ColumnStores CMPSCI 445 Fall 2010 Some slides courtesy of Yanlei Diao, Christophe Bisciglia, Aaron Kimball, & Sierra Michels- Slettvet Motivation:
More informationDatabases 2 (VU) ( / )
Databases 2 (VU) (706.711 / 707.030) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, 2017 1 / 42 Outline 1 Problems Suited for Map-Reduce 2 MapReduce:
More informationBig Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.
Big Data Programming: an Introduction Spring 2015, X. Zhang Fordham Univ. Outline What the course is about? scope Introduction to big data programming Opportunity and challenge of big data Origin of Hadoop
More informationAndrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs
Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09 Presented by: Daniel Isaacs It all starts with cluster computing. MapReduce Why
More informationInternational Journal of Computer Engineering and Applications, BIG DATA ANALYTICS USING APACHE PIG Prabhjot Kaur
Prabhjot Kaur Department of Computer Engineering ME CSE(BIG DATA ANALYTICS)-CHANDIGARH UNIVERSITY,GHARUAN kaurprabhjot770@gmail.com ABSTRACT: In today world, as we know data is expanding along with the
More informationCSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 11 Parallel DBMSs and MapReduce
CSE 544 Principles of Database Management Systems Magdalena Balazinska Winter 2015 Lecture 11 Parallel DBMSs and MapReduce References Parallel Database Systems: The Future of High Performance Database
More informationIntroduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data
Introduction to Hadoop High Availability Scaling Advantages and Challenges Introduction to Big Data What is Big data Big Data opportunities Big Data Challenges Characteristics of Big data Introduction
More informationCompSci 516: Database Systems
CompSci 516 Database Systems Lecture 12 Map-Reduce and Spark Instructor: Sudeepa Roy Duke CS, Fall 2017 CompSci 516: Database Systems 1 Announcements Practice midterm posted on sakai First prepare and
More informationBigtable. Presenter: Yijun Hou, Yixiao Peng
Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google, Inc. OSDI 06 Presenter: Yijun Hou, Yixiao Peng
More informationCIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu
CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin Presented by: Suhua Wei Yong Yu Papers: MapReduce: Simplified Data Processing on Large Clusters 1 --Jeffrey Dean
More informationHadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)
Bigdata Fundamentals Day1: (2hours) 1. Understanding BigData. a. What is Big Data? b. Big-Data characteristics. c. Challenges with the traditional Data Base Systems and Distributed Systems. 2. Distributions:
More informationTI2736-B Big Data Processing. Claudia Hauff
TI2736-B Big Data Processing Claudia Hauff ti2736b-ewi@tudelft.nl Intro Streams Streams Map Reduce HDFS Pig Pig Design Patterns Hadoop Ctd. Graphs Giraph Spark Zoo Keeper Spark Learning objectives Implement
More informationIntroduction to MapReduce
Introduction to MapReduce Jacqueline Chame CS503 Spring 2014 Slides based on: MapReduce: Simplified Data Processing on Large Clusters. Jeffrey Dean and Sanjay Ghemawat. OSDI 2004. MapReduce: The Programming
More informationPage 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24
Goals for Today" CS162 Operating Systems and Systems Programming Lecture 24 Capstone: Cloud Computing" Distributed systems Cloud Computing programming paradigms Cloud Computing OS December 2, 2013 Anthony
More informationCSE-E5430 Scalable Cloud Computing Lecture 9
CSE-E5430 Scalable Cloud Computing Lecture 9 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 15.11-2015 1/24 BigTable Described in the paper: Fay
More informationDistributed computing: index building and use
Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput
More informationHadoop & Big Data Analytics Complete Practical & Real-time Training
An ISO Certified Training Institute A Unit of Sequelgate Innovative Technologies Pvt. Ltd. www.sqlschool.com Hadoop & Big Data Analytics Complete Practical & Real-time Training Mode : Instructor Led LIVE
More informationPig Latin Reference Manual 1
Table of contents 1 Overview.2 2 Pig Latin Statements. 2 3 Multi-Query Execution 5 4 Specialized Joins..10 5 Optimization Rules. 13 6 Memory Management15 7 Zebra Integration..15 1. Overview Use this manual
More informationLecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)!
Big Data Processing, 2014/15 Lecture 7: MapReduce design patterns!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm
More informationDistributed Systems. 18. MapReduce. Paul Krzyzanowski. Rutgers University. Fall 2015
Distributed Systems 18. MapReduce Paul Krzyzanowski Rutgers University Fall 2015 November 21, 2016 2014-2016 Paul Krzyzanowski 1 Credit Much of this information is from Google: Google Code University [no
More informationBig Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture
Big Data Syllabus Hadoop YARN Setup Programming in YARN framework j Understanding big data and Hadoop Big Data Limitations and Solutions of existing Data Analytics Architecture Hadoop Features Hadoop Ecosystem
More informationdata parallelism Chris Olston Yahoo! Research
data parallelism Chris Olston Yahoo! Research set-oriented computation data management operations tend to be set-oriented, e.g.: apply f() to each member of a set compute intersection of two sets easy
More informationBig Data Analytics using Apache Hadoop and Spark with Scala
Big Data Analytics using Apache Hadoop and Spark with Scala Training Highlights : 80% of the training is with Practical Demo (On Custom Cloudera and Ubuntu Machines) 20% Theory Portion will be important
More informationApril Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.
1. MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model. MapReduce is a framework for processing big data which processes data in two phases, a Map
More informationCS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab
CS6030 Cloud Computing Ajay Gupta B239, CEAS Computer Science Department Western Michigan University ajay.gupta@wmich.edu 276-3104 1 Acknowledgements I have liberally borrowed these slides and material
More informationMapReduce-style data processing
MapReduce-style data processing Software Languages Team University of Koblenz-Landau Ralf Lämmel and Andrei Varanovich Related meanings of MapReduce Functional programming with map & reduce An algorithmic
More informationAnnouncements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems
Announcements CompSci 516 Database Systems Lecture 12 - and Spark Practice midterm posted on sakai First prepare and then attempt! Midterm next Wednesday 10/11 in class Closed book/notes, no electronic
More informationHadoop Development Introduction
Hadoop Development Introduction What is Bigdata? Evolution of Bigdata Types of Data and their Significance Need for Bigdata Analytics Why Bigdata with Hadoop? History of Hadoop Why Hadoop is in demand
More informationDistributed Computation Models
Distributed Computation Models SWE 622, Spring 2017 Distributed Software Engineering Some slides ack: Jeff Dean HW4 Recap https://b.socrative.com/ Class: SWE622 2 Review Replicating state machines Case
More information