Databases 2 (VU) ( / )

Size: px
Start display at page:

Download "Databases 2 (VU) ( / )"

Transcription

1 Databases 2 (VU) ( / ) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

2 Outline 1 Problems Suited for Map-Reduce 2 MapReduce: Applications Matrix-Vector Multiplication Information Retrieval 3 Hadoop Ecosystem Big Data Storage Technologies Slides are partially based on Slides Mining Massive Datasets by Jure Leskovec Slides MapReduce Runtime Environments by Gilles Fedak Slides Tutorial: MapReduce Theory and Practice of Data-intensive Applications by Pietro Michiardi MapReduce: Simplified Data Processing on Large Clusters by Dean et al., The Curse of Zipf and Limits to Parallelization: A Look at the Stragglers Problem in MapReduce by J.Lin, Limitations and Challenges of HDFS and MapReduce by Weets et al., Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

3 Problems Suited for Map-Reduce When MapReduce is a suitable choice analyzing textual data; from our examples: computing statistics on words... Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

4 Problems Suited for Map-Reduce When MapReduce is a suitable choice analyzing textual data; from our examples: computing statistics on words... in general, when files are large and are rarely updated in place not suitable when managing online sales for instance, the principal operations on Amazon data involve responding to searches for products, recording sales, and so on; processes that involve relatively little calculation and that change the database Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

5 Problems Suited for Map-Reduce When MapReduce is a suitable choice analyzing textual data; from our examples: computing statistics on words... in general, when files are large and are rarely updated in place not suitable when managing online sales for instance, the principal operations on Amazon data involve responding to searches for products, recording sales, and so on; processes that involve relatively little calculation and that change the database instead, use MapReduce for analytical queries on data generated, e.g., by a Web application to find users with similar buying patterns or to rank search results (PageRank),... not suitable for handling Web requests (even if we have millions of users) Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

6 Problems Suited for Map-Reduce When MapReduce is not a suitable choice not always easy to formulate and implement each and everything as MR program (= parallelizable according to the divide&conquer idea) Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

7 Problems Suited for Map-Reduce When MapReduce is not a suitable choice not always easy to formulate and implement each and everything as MR program (= parallelizable according to the divide&conquer idea) when your processing requires a lot of data to be shuffled over the network remember the idea is to bring the algorithm to the data Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

8 Problems Suited for Map-Reduce When MapReduce is not a suitable choice not always easy to formulate and implement each and everything as MR program (= parallelizable according to the divide&conquer idea) when your processing requires a lot of data to be shuffled over the network remember the idea is to bring the algorithm to the data real-time processing - when fast responses are needed for complex algorithms for example, machine learning algorithms such as SVMs Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

9 Problems Suited for Map-Reduce When MapReduce is not a suitable choice not always easy to formulate and implement each and everything as MR program (= parallelizable according to the divide&conquer idea) when your processing requires a lot of data to be shuffled over the network remember the idea is to bring the algorithm to the data real-time processing - when fast responses are needed for complex algorithms for example, machine learning algorithms such as SVMs processing graphs use frameworks such as Giraph Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

10 Problems Suited for Map-Reduce When MapReduce is not a suitable choice when lot of data needs to be sorted / shuffled, e.g. the map phase generates too many keys shuffling is very time-consuming often overloading the network reason is that the reducers fetch all intermediate data at once as soon as the last mapper finishes led to strategies such as virtual shuffling or predictive scheduling Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

11 Problems Suited for Map-Reduce When MapReduce is not a suitable choice when lot of data needs to be sorted / shuffled, e.g. the map phase generates too many keys shuffling is very time-consuming often overloading the network reason is that the reducers fetch all intermediate data at once as soon as the last mapper finishes led to strategies such as virtual shuffling or predictive scheduling when iterations are needed for example clustering algorithms such as K-Means use frameworks such as Spark Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

12 Problems Suited for Map-Reduce When MapReduce is not a suitable choice when lot of data needs to be sorted / shuffled, e.g. the map phase generates too many keys shuffling is very time-consuming often overloading the network reason is that the reducers fetch all intermediate data at once as soon as the last mapper finishes led to strategies such as virtual shuffling or predictive scheduling when iterations are needed for example clustering algorithms such as K-Means use frameworks such as Spark handling streaming data MR is best suited to batch process hugh amounts of data use frameworks such as Storm Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

13 MapReduce: Applications MapReduce: Typical Applications computations such as analytic queries typically involve matrix operations Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

14 MapReduce: Applications MapReduce: Typical Applications computations such as analytic queries typically involve matrix operations original purpose for the MapReduce implementation was to execute large matrix-vector multiplications to calculate the PageRank these matrix operations such as matrix-matrix and matrix-vector multiplications fit nicely into the MapReduce programming model Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

15 MapReduce: Applications MapReduce: Typical Applications computations such as analytic queries typically involve matrix operations original purpose for the MapReduce implementation was to execute large matrix-vector multiplications to calculate the PageRank these matrix operations such as matrix-matrix and matrix-vector multiplications fit nicely into the MapReduce programming model another important class of operations that can use MapReduce effectively are relational-algebra operations many operations on data can be described easily in terms of the common database-query primitives such as selection, union, intersection, joins which can be used for social network analysis, e.g. finding paths of different lengths, counting friends, etc. Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

16 MapReduce: Applications Application 1: Matrix-Vector Multiplication What do we need to do? Suppose we have an n n matrix M, whose element in row i and column j will be denoted m ij. Suppose we also have a vector v of length n, whose jth element is v j. Then the matrix-vector product is the vector x of length n, whose ith element is given by x i = n m ij v j j=1 Outline a Map-Reduce program that calculates the vector x. Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

17 MapReduce: Applications Matrix-Vector Multiplication Matrix-Vector Multiplication Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

18 MapReduce: Applications Matrix-Vector Multiplication Matrix-Vector Multiplication let us first assume that the vector v is large, but it still can fit into the memory Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

19 MapReduce: Applications Matrix-Vector Multiplication Matrix-Vector Multiplication let us first assume that the vector v is large, but it still can fit into the memory the matrix M and the vector v will be each stored in a file of the DFS Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

20 MapReduce: Applications Matrix-Vector Multiplication Matrix-Vector Multiplication let us first assume that the vector v is large, but it still can fit into the memory the matrix M and the vector v will be each stored in a file of the DFS assume that the row-column coordinates of a matrix element (indices) can be discovered for example, each value is stored as a triple (i, j, m ij ) similarly, the position of v j can be discovered analogously Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

21 MapReduce: Applications Matrix-Vector Multiplication Matrix-Vector Multiplication So how does the Map Function look like? Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

22 MapReduce: Applications Matrix-Vector Multiplication Matrix-Vector Multiplication So how does the Map Function look like? a map function applies to one element of the matrix M the vector v is first read in its entirety and is available for all Map tasks at that compute node Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

23 MapReduce: Applications Matrix-Vector Multiplication Matrix-Vector Multiplication So how does the Map Function look like? a map function applies to one element of the matrix M the vector v is first read in its entirety and is available for all Map tasks at that compute node from each matrix element m ij the map function produces the key-value pair (i, m ij v j ) Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

24 MapReduce: Applications Matrix-Vector Multiplication Matrix-Vector Multiplication So how does the Map Function look like? a map function applies to one element of the matrix M the vector v is first read in its entirety and is available for all Map tasks at that compute node from each matrix element m ij the map function produces the key-value pair (i, m ij v j ) all terms of the sum that make up the component x i of the matrix-vector product will get the same key i Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

25 MapReduce: Applications Matrix-Vector Multiplication Matrix-Vector Multiplication What about the Reduce Function? Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

26 MapReduce: Applications Matrix-Vector Multiplication Matrix-Vector Multiplication What about the Reduce Function? reduce function sums all the values associated with a given key i result is a pair (i, x i ) Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

27 MapReduce: Applications Matrix-Vector Multiplication Matrix-Vector Multiplication What about the Reduce Function? reduce function sums all the values associated with a given key i result is a pair (i, x i ) thereby calculating one entry in the output vector x Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

28 MapReduce: Applications Matrix-Vector Multiplication Matrix-Vector Multiplication: Divide&Conquer however, it might be that the vector v does not fit into main memory it is not required that the vector v fits into the memory at a compute node, but if it does not there will be a very large number of disk accesses as we move pieces of the vector into main memory Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

29 MapReduce: Applications Matrix-Vector Multiplication Matrix-Vector Multiplication: Divide&Conquer however, it might be that the vector v does not fit into main memory it is not required that the vector v fits into the memory at a compute node, but if it does not there will be a very large number of disk accesses as we move pieces of the vector into main memory alternatively we can divide the matrix M into vertical stripes of equal width and divide the vector into an equal number of horizontal stripes of the same height use enough stripes so that the portion of the vector in one stripe can fit into main memory at a compute node Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

30 MapReduce: Applications Matrix-Vector Multiplication Matrix-Vector Multiplication: Divide&Conquer Figure: Divide matrix M and vector v into stripes Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

31 MapReduce: Applications Matrix-Vector Multiplication Matrix-Vector Multiplication: Divide&Conquer the ith stripe of matrix M multiplies only components from the ith stripe of the vector Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

32 MapReduce: Applications Matrix-Vector Multiplication Matrix-Vector Multiplication: Divide&Conquer the ith stripe of matrix M multiplies only components from the ith stripe of the vector can divide matrix M into one file for each stripe, and do the same for the vector v Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

33 MapReduce: Applications Matrix-Vector Multiplication Matrix-Vector Multiplication: Divide&Conquer the ith stripe of matrix M multiplies only components from the ith stripe of the vector can divide matrix M into one file for each stripe, and do the same for the vector v each Map task is assigned a chunk from one of the stripes in the matrix and gets the entire corresponding stripe of the vector Map and Reduce tasks can then act exactly as before need to sum up once more the results of the stripes multiplication Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

34 MapReduce: Applications Information Retrieval Application 2: Information Retrieval Information Retrieval (IR) in a Nutshell is the activity of obtaining information resources which are relevant to an information need (for instance a search query you submit to Google) from a collection of information resources (for instance databases of texts, images or sounds). Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

35 MapReduce: Applications Information Retrieval Application 2: Information Retrieval Information Retrieval (IR) in a Nutshell is the activity of obtaining information resources which are relevant to an information need (for instance a search query you submit to Google) from a collection of information resources (for instance databases of texts, images or sounds). Information Retrieval is so to say the Science of Searching for Information Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

36 MapReduce: Applications Information Retrieval Information Retrieval Process Figure: taken from Eissa Alshari Semantic Arabic Information Retrieval Framework Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

37 MapReduce: Applications Information Retrieval Search Process - Step 1: Indexing Figure: taken from Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

38 MapReduce: Applications Information Retrieval Search Process - Step 2: Retrieval Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

39 MapReduce: Applications Information Retrieval Are Indexing or Retrieval MapReducable? The Indexing Problem Scalability is critical must be relatively fast, but need not be real time fundamentally a batch operation sounds like a Map-Reduce problem Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

40 MapReduce: Applications Information Retrieval Are Indexing or Retrieval MapReducable? The Indexing Problem Scalability is critical must be relatively fast, but need not be real time fundamentally a batch operation sounds like a Map-Reduce problem The Retrieval Problem must have sub-second response time for the web, only need relatively few results rather not a Map-Reduce problem Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

41 MapReduce: Applications Information Retrieval Index Construction with MapReduce Map over all documents emit term as key, (docnr., termfrequency) as value emit other information as necessary (e.g., term position) Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

42 MapReduce: Applications Information Retrieval Index Construction with MapReduce Map over all documents emit term as key, (docnr., termfrequency) as value emit other information as necessary (e.g., term position) Sort/shuffle: group postings by term Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

43 MapReduce: Applications Information Retrieval Index Construction with MapReduce Map over all documents emit term as key, (docnr., termfrequency) as value emit other information as necessary (e.g., term position) Sort/shuffle: group postings by term Reduce gather and sort the postings (e.g., by docnr. or tf) write postings to disk Map Reduce does all the heavy lifting Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

44 MapReduce: Applications Information Retrieval Index Construction with MapReduce: An Overview Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

45 MapReduce: Applications Information Retrieval including information on term position: Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

46 MapReduce: Applications Information Retrieval however, this initial implementation can run into scalability problems if storing terms (e.g. fish) as keys, the reducers must buffer all postings associated with the key fish what if we run out of memory to buffer the postings? Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

47 MapReduce: Applications Information Retrieval however, this initial implementation can run into scalability problems if storing terms (e.g. fish) as keys, the reducers must buffer all postings associated with the key fish what if we run out of memory to buffer the postings? so instead: let the framework do the sorting; the term frequency is implicitly stored; postings are directly written to the disk Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

48 MapReduce: Applications Information Retrieval however, this initial implementation can run into scalability problems if storing terms (e.g. fish) as keys, the reducers must buffer all postings associated with the key fish what if we run out of memory to buffer the postings? so instead: let the framework do the sorting; the term frequency is implicitly stored; postings are directly written to the disk Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

49 Hadoop Ecosystem Hadoop Eco System (v1) Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

50 Hadoop Ecosystem Hadoop Eco System (v1) HBase open source, non-relational, distributed database modeled after Google s BigTable and is written in Java provides a fault-tolerant way of storing large quantities of sparse data Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

51 Hadoop Ecosystem Hadoop Eco System (v1) HBase Hive open source, non-relational, distributed database modeled after Google s BigTable and is written in Java provides a fault-tolerant way of storing large quantities of sparse data data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis a data warehouse is a system used for reporting and data analysis Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

52 Hadoop Ecosystem Hadoop Eco System (v1) HBase Hive Pig open source, non-relational, distributed database modeled after Google s BigTable and is written in Java provides a fault-tolerant way of storing large quantities of sparse data data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis a data warehouse is a system used for reporting and data analysis high-level platform for creating programs that run on Apache Hadoop (language is called Pig Latin) abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

53 Hadoop Ecosystem Hadoop Eco System (v2) Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

54 Hadoop Ecosystem Hadoop Eco System (v2) Yarn (Yet Another Resource Negotiator) is a cluster management system to run Big Data applications on a cluster data; not a data processing platform itself but enables the platforms to run their code in a cluster environment Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

55 Hadoop Ecosystem Hadoop Eco System (v2) Yarn (Yet Another Resource Negotiator) is a cluster management system to run Big Data applications on a cluster data; not a data processing platform itself but enables the platforms to run their code in a cluster environment Spark, Flink, Storm are frameworks for cluster computing specializing in either batch processing, stream processing or both (see next slide) Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

56 Hadoop Ecosystem Hadoop Eco System (v2) Yarn (Yet Another Resource Negotiator) is a cluster management system to run Big Data applications on a cluster data; not a data processing platform itself but enables the platforms to run their code in a cluster environment Spark, Flink, Storm Giraph Impala are frameworks for cluster computing specializing in either batch processing, stream processing or both (see next slide) utilizes Apache Hadoop s MapReduce implementation to process graphs is Cloudera s open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

57 Hadoop Ecosystem Data Processing: From Batch to Streaming... Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

58 Hadoop Ecosystem Java(-ish) is the Hadoop Language Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

59 Hadoop Ecosystem The Good, the Bad and the Ugly wrt. the Ecosystem The Good Easy to write parallel, highly scalable applications Stream and Batch processing Seamless integration with other systems (e.g. RDBMS) Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

60 Hadoop Ecosystem The Good, the Bad and the Ugly wrt. the Ecosystem The Good Easy to write parallel, highly scalable applications Stream and Batch processing Seamless integration with other systems (e.g. RDBMS) The Bad Rapid development, hard to keep overview 150 projects in (or near) Hadoop Eco System see Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

61 Hadoop Ecosystem The Good, the Bad and the Ugly wrt. the Ecosystem The Good Easy to write parallel, highly scalable applications Stream and Batch processing Seamless integration with other systems (e.g. RDBMS) The Bad Rapid development, hard to keep overview 150 projects in (or near) Hadoop Eco System see The Ugly Maven dependency hell, if integrated with other systems Spark depends on > 50 libraries with a specific version! Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

62 Hadoop Ecosystem History of Hadoop Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

63 Hadoop Ecosystem System Design usually boils down to picking the right components wrt. Input, Processing + Output Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

64 Hadoop Ecosystem System Design Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

65 Hadoop Ecosystem Big Data Storage Technologies Big Data Storage Technologies File-based: HDFS distributed, permanent file storage tuned for large files no indexing Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

66 Hadoop Ecosystem Big Data Storage Technologies Big Data Storage Technologies File-based: HDFS distributed, permanent file storage tuned for large files no indexing Key-Value based: HBase distributed Key/Value store fast look-ups based on HDFS Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

67 Hadoop Ecosystem Big Data Storage Technologies Big Data Storage Technologies File-based: HDFS distributed, permanent file storage tuned for large files no indexing Key-Value based: HBase distributed Key/Value store fast look-ups based on HDFS Message-based: Kafka distributed Producer/Consumer messaging system data partitioned in topics producer groups / consumer groups Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

68 Hadoop Ecosystem Big Data Storage Technologies File-based: HDFS Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

69 Hadoop Ecosystem Big Data Storage Technologies File-based: HDFS Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

70 Hadoop Ecosystem Big Data Storage Technologies Key-Value based: HBase Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

71 Hadoop Ecosystem Big Data Storage Technologies Key-Value based: HBase Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

72 Hadoop Ecosystem Big Data Storage Technologies Message-based: Kafka Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

73 Hadoop Ecosystem Big Data Storage Technologies Message-based: Kafka Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

74 Hadoop Ecosystem Big Data Storage Technologies Map Reduce Lectures Recap: Part 1: Handling Big Data Key Elements: MapReduce Framework & Distributed File System Part 2: Optimization: Maximizing Parallelism Stragglers Problem & Input Data Skew Part 3: Suitability & Applications Hadoop Ecosystem & Storage Technologies Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

75 Hadoop Ecosystem Big Data Storage Technologies The End Next: Beyond Map-Reduce Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, / 42

2.3 Algorithms Using Map-Reduce

2.3 Algorithms Using Map-Reduce 28 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK one becomes available. The Master must also inform each Reduce task that the location of its input from that Map task has changed. Dealing with a failure

More information

Databases 2 (VU) ( )

Databases 2 (VU) ( ) Databases 2 (VU) (707.030) Map-Reduce Denis Helic KMI, TU Graz Nov 4, 2013 Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 1 / 90 Outline 1 Motivation 2 Large Scale Computation 3 Map-Reduce 4 Environment

More information

Data Partitioning and MapReduce

Data Partitioning and MapReduce Data Partitioning and MapReduce Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Intelligent Decision Support Systems Master studies,

More information

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou The Hadoop Ecosystem EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca A lot of tools designed to work with Hadoop 2 HDFS, MapReduce Hadoop Distributed File System Core Hadoop component

More information

Improving the MapReduce Big Data Processing Framework

Improving the MapReduce Big Data Processing Framework Improving the MapReduce Big Data Processing Framework Gistau, Reza Akbarinia, Patrick Valduriez INRIA & LIRMM, Montpellier, France In collaboration with Divyakant Agrawal, UCSB Esther Pacitti, UM2, LIRMM

More information

Big Data Hadoop Stack

Big Data Hadoop Stack Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware

More information

Big Data with Hadoop Ecosystem

Big Data with Hadoop Ecosystem Diógenes Pires Big Data with Hadoop Ecosystem Hands-on (HBase, MySql and Hive + Power BI) Internet Live http://www.internetlivestats.com/ Introduction Business Intelligence Business Intelligence Process

More information

Big Data Architect.

Big Data Architect. Big Data Architect www.austech.edu.au WHAT IS BIG DATA ARCHITECT? A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional

More information

Matrix-Vector Multiplication by MapReduce. From Rajaraman / Ullman- Ch.2 Part 1

Matrix-Vector Multiplication by MapReduce. From Rajaraman / Ullman- Ch.2 Part 1 Matrix-Vector Multiplication by MapReduce From Rajaraman / Ullman- Ch.2 Part 1 Google implementation of MapReduce created to execute very large matrix-vector multiplications When ranking of Web pages that

More information

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344 Where We Are Introduction to Data Management CSE 344 Lecture 22: MapReduce We are talking about parallel query processing There exist two main types of engines: Parallel DBMSs (last lecture + quick review)

More information

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Hadoop 1.0 Architecture Introduction to Hadoop & Big Data Hadoop Evolution Hadoop Architecture Networking Concepts Use cases

More information

Innovatus Technologies

Innovatus Technologies HADOOP 2.X BIGDATA ANALYTICS 1. Java Overview of Java Classes and Objects Garbage Collection and Modifiers Inheritance, Aggregation, Polymorphism Command line argument Abstract class and Interfaces String

More information

Hadoop An Overview. - Socrates CCDH

Hadoop An Overview. - Socrates CCDH Hadoop An Overview - Socrates CCDH What is Big Data? Volume Not Gigabyte. Terabyte, Petabyte, Exabyte, Zettabyte - Due to handheld gadgets,and HD format images and videos - In total data, 90% of them collected

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANT SCALE www.elephantscale.com sujee@elephantscale.com HI, I M SUJEE MANIYAM Founder / Principal @ ElephantScale

More information

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case

More information

CSE 444: Database Internals. Lecture 23 Spark

CSE 444: Database Internals. Lecture 23 Spark CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei

More information

CISC 7610 Lecture 2b The beginnings of NoSQL

CISC 7610 Lecture 2b The beginnings of NoSQL CISC 7610 Lecture 2b The beginnings of NoSQL Topics: Big Data Google s infrastructure Hadoop: open google infrastructure Scaling through sharding CAP theorem Amazon s Dynamo 5 V s of big data Everyone

More information

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION

More information

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414 Introduction to Database Systems CSE 414 Lecture 17: MapReduce and Spark Announcements Midterm this Friday in class! Review session tonight See course website for OHs Includes everything up to Monday s

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018 Big Data com Hadoop Impala, Hive e Spark VIII Sessão - SQL Bahia 03/03/2018 Diógenes Pires Connect with PASS Sign up for a free membership today at: pass.org #sqlpass Internet Live http://www.internetlivestats.com/

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI) CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI) The Certificate in Software Development Life Cycle in BIGDATA, Business Intelligence and Tableau program

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark CSE 544 Principles of Database Management Systems Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark Announcements HW2 due this Thursday AWS accounts Any success? Feel

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 26: Parallel Databases and MapReduce CSE 344 - Winter 2013 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Cluster will run in Amazon s cloud (AWS)

More information

The amount of data increases every day Some numbers ( 2012):

The amount of data increases every day Some numbers ( 2012): 1 The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect

More information

2/26/2017. The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012): The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect to

More information

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance

More information

Introduction to Big-Data

Introduction to Big-Data Introduction to Big-Data Ms.N.D.Sonwane 1, Mr.S.P.Taley 2 1 Assistant Professor, Computer Science & Engineering, DBACER, Maharashtra, India 2 Assistant Professor, Information Technology, DBACER, Maharashtra,

More information

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals

More information

Lecture Map-Reduce. Algorithms. By Marina Barsky Winter 2017, University of Toronto

Lecture Map-Reduce. Algorithms. By Marina Barsky Winter 2017, University of Toronto Lecture 04.02 Map-Reduce Algorithms By Marina Barsky Winter 2017, University of Toronto Example 1: Language Model Statistical machine translation: Need to count number of times every 5-word sequence occurs

More information

MapReduce Algorithms

MapReduce Algorithms Large-scale data processing on the Cloud Lecture 3 MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation Commons Attribution 3.0 License) Outline

More information

microsoft

microsoft 70-775.microsoft Number: 70-775 Passing Score: 800 Time Limit: 120 min Exam A QUESTION 1 Note: This question is part of a series of questions that present the same scenario. Each question in the series

More information

Distributed computing: index building and use

Distributed computing: index building and use Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 24: MapReduce CSE 344 - Winter 215 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Due next Thursday evening Will send out reimbursement codes later

More information

Map Reduce & Hadoop Recommended Text:

Map Reduce & Hadoop Recommended Text: Map Reduce & Hadoop Recommended Text: Hadoop: The Definitive Guide Tom White O Reilly 2010 VMware Inc. All rights reserved Big Data! Large datasets are becoming more common The New York Stock Exchange

More information

Hadoop. Introduction / Overview

Hadoop. Introduction / Overview Hadoop Introduction / Overview Preface We will use these PowerPoint slides to guide us through our topic. Expect 15 minute segments of lecture Expect 1-4 hour lab segments Expect minimal pretty pictures

More information

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm Announcements HW5 is due tomorrow 11pm Database Systems CSE 414 Lecture 19: MapReduce (Ch. 20.2) HW6 is posted and due Nov. 27 11pm Section Thursday on setting up Spark on AWS Create your AWS account before

More information

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per

More information

A Glimpse of the Hadoop Echosystem

A Glimpse of the Hadoop Echosystem A Glimpse of the Hadoop Echosystem 1 Hadoop Echosystem A cluster is shared among several users in an organization Different services HDFS and MapReduce provide the lower layers of the infrastructures Other

More information

A BigData Tour HDFS, Ceph and MapReduce

A BigData Tour HDFS, Ceph and MapReduce A BigData Tour HDFS, Ceph and MapReduce These slides are possible thanks to these sources Jonathan Drusi - SCInet Toronto Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing SICS; Yahoo!

More information

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

Page 1. Goals for Today Background of Cloud Computing Sources Driving Big Data CS162 Operating Systems and Systems Programming Lecture 24 Goals for Today" CS162 Operating Systems and Systems Programming Lecture 24 Capstone: Cloud Computing" Distributed systems Cloud Computing programming paradigms Cloud Computing OS December 2, 2013 Anthony

More information

Database Systems CSE 414

Database Systems CSE 414 Database Systems CSE 414 Lecture 19: MapReduce (Ch. 20.2) CSE 414 - Fall 2017 1 Announcements HW5 is due tomorrow 11pm HW6 is posted and due Nov. 27 11pm Section Thursday on setting up Spark on AWS Create

More information

Certified Big Data and Hadoop Course Curriculum

Certified Big Data and Hadoop Course Curriculum Certified Big Data and Hadoop Course Curriculum The Certified Big Data and Hadoop course by DataFlair is a perfect blend of in-depth theoretical knowledge and strong practical skills via implementation

More information

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours) Bigdata Fundamentals Day1: (2hours) 1. Understanding BigData. a. What is Big Data? b. Big-Data characteristics. c. Challenges with the traditional Data Base Systems and Distributed Systems. 2. Distributions:

More information

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia MapReduce Spark Some slides are adapted from those of Jeff Dean and Matei Zaharia What have we learnt so far? Distributed storage systems consistency semantics protocols for fault tolerance Paxos, Raft,

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

Big Data Hadoop Course Content

Big Data Hadoop Course Content Big Data Hadoop Course Content Topics covered in the training Introduction to Linux and Big Data Virtual Machine ( VM) Introduction/ Installation of VirtualBox and the Big Data VM Introduction to Linux

More information

Certified Big Data Hadoop and Spark Scala Course Curriculum

Certified Big Data Hadoop and Spark Scala Course Curriculum Certified Big Data Hadoop and Spark Scala Course Curriculum The Certified Big Data Hadoop and Spark Scala course by DataFlair is a perfect blend of indepth theoretical knowledge and strong practical skills

More information

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Blended Learning Outline: Cloudera Data Analyst Training (171219a) Blended Learning Outline: Cloudera Data Analyst Training (171219a) Cloudera Univeristy s data analyst training course will teach you to apply traditional data analytics and business intelligence skills

More information

Apache Hive for Oracle DBAs. Luís Marques

Apache Hive for Oracle DBAs. Luís Marques Apache Hive for Oracle DBAs Luís Marques About me Oracle ACE Alumnus Long time open source supporter Founder of Redglue (www.redglue.eu) works for @redgluept as Lead Data Architect @drune After this talk,

More information

Map Reduce. Yerevan.

Map Reduce. Yerevan. Map Reduce Erasmus+ @ Yerevan dacosta@irit.fr Divide and conquer at PaaS 100 % // Typical problem Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate

More information

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development:: Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized

More information

Shark: Hive (SQL) on Spark

Shark: Hive (SQL) on Spark Shark: Hive (SQL) on Spark Reynold Xin UC Berkeley AMP Camp Aug 21, 2012 UC BERKELEY SELECT page_name, SUM(page_views) views FROM wikistats GROUP BY page_name ORDER BY views DESC LIMIT 10; Stage 0: Map-Shuffle-Reduce

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

Introduction to Hadoop and MapReduce

Introduction to Hadoop and MapReduce Introduction to Hadoop and MapReduce Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large

More information

Distributed computing: index building and use

Distributed computing: index building and use Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput

More information

Hadoop, Yarn and Beyond

Hadoop, Yarn and Beyond Hadoop, Yarn and Beyond 1 B. R A M A M U R T H Y Overview We learned about Hadoop1.x or the core. Just like Java evolved, Java core, Java 1.X, Java 2.. So on, software and systems evolve, naturally.. Lets

More information

Introduction to BigData, Hadoop:-

Introduction to BigData, Hadoop:- Introduction to BigData, Hadoop:- Big Data Introduction: Hadoop Introduction What is Hadoop? Why Hadoop? Hadoop History. Different types of Components in Hadoop? HDFS, MapReduce, PIG, Hive, SQOOP, HBASE,

More information

TI2736-B Big Data Processing. Claudia Hauff

TI2736-B Big Data Processing. Claudia Hauff TI2736-B Big Data Processing Claudia Hauff ti2736b-ewi@tudelft.nl Intro Streams Streams Map Reduce HDFS Pig Pig Design Patterns Hadoop Ctd. Graphs Giraph Spark Zoo Keeper Spark Learning objectives Implement

More information

MapReduce and Friends

MapReduce and Friends MapReduce and Friends Craig C. Douglas University of Wyoming with thanks to Mookwon Seo Why was it invented? MapReduce is a mergesort for large distributed memory computers. It was the basis for a web

More information

Distributed Computing with Spark and MapReduce

Distributed Computing with Spark and MapReduce Distributed Computing with Spark and MapReduce Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Traditional Network Programming Message-passing between nodes (e.g. MPI) Very difficult to do at scale:» How

More information

April Copyright 2013 Cloudera Inc. All rights reserved.

April Copyright 2013 Cloudera Inc. All rights reserved. Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and the Virtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here April 2014 Analytic Workloads on

More information

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem Zohar Elkayam www.realdbamagic.com Twitter: @realmgic Who am I? Zohar Elkayam, CTO at Brillix Programmer, DBA, team leader, database trainer,

More information

Fall 2018: Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU

Fall 2018: Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU Fall 2018: Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU !2 MapReduce Overview! Sometimes a single computer cannot process data or takes too long traditional serial programming is not always

More information

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism Big Data and Hadoop with Azure HDInsight Andrew Brust Senior Director, Technical Product Marketing and Evangelism Datameer Level: Intermediate Meet Andrew Senior Director, Technical Product Marketing and

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 24: MapReduce CSE 344 - Fall 2016 1 HW8 is out Last assignment! Get Amazon credits now (see instructions) Spark with Hadoop Due next wed CSE 344 - Fall 2016

More information

Challenges for Data Driven Systems

Challenges for Data Driven Systems Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Data Centric Systems and Networking Emergence of Big Data Shift of Communication Paradigm From end-to-end to data

More information

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

EXTRACT DATA IN LARGE DATABASE WITH HADOOP International Journal of Advances in Engineering & Scientific Research (IJAESR) ISSN: 2349 3607 (Online), ISSN: 2349 4824 (Print) Download Full paper from : http://www.arseam.com/content/volume-1-issue-7-nov-2014-0

More information

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1 MapReduce-II September 2013 Alberto Abelló & Oscar Romero 1 Knowledge objectives 1. Enumerate the different kind of processes in the MapReduce framework 2. Explain the information kept in the master 3.

More information

Vendor: Cloudera. Exam Code: CCD-410. Exam Name: Cloudera Certified Developer for Apache Hadoop. Version: Demo

Vendor: Cloudera. Exam Code: CCD-410. Exam Name: Cloudera Certified Developer for Apache Hadoop. Version: Demo Vendor: Cloudera Exam Code: CCD-410 Exam Name: Cloudera Certified Developer for Apache Hadoop Version: Demo QUESTION 1 When is the earliest point at which the reduce method of a given Reducer can be called?

More information

Hadoop. copyright 2011 Trainologic LTD

Hadoop. copyright 2011 Trainologic LTD Hadoop Hadoop is a framework for processing large amounts of data in a distributed manner. It can scale up to thousands of machines. It provides high-availability. Provides map-reduce functionality. Hides

More information

Big Data Analytics using Apache Hadoop and Spark with Scala

Big Data Analytics using Apache Hadoop and Spark with Scala Big Data Analytics using Apache Hadoop and Spark with Scala Training Highlights : 80% of the training is with Practical Demo (On Custom Cloudera and Ubuntu Machines) 20% Theory Portion will be important

More information

Big Data Infrastructures & Technologies

Big Data Infrastructures & Technologies Big Data Infrastructures & Technologies Spark and MLLIB OVERVIEW OF SPARK What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: In-memory

More information

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training:: Module Title Duration : Cloudera Data Analyst Training : 4 days Overview Take your knowledge to the next level Cloudera University s four-day data analyst training course will teach you to apply traditional

More information

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second

More information

MapReduce, Hadoop and Spark. Bompotas Agorakis

MapReduce, Hadoop and Spark. Bompotas Agorakis MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)

More information

Lecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)!

Lecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)! Big Data Processing, 2014/15 Lecture 7: MapReduce design patterns!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm

More information

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture Big Data Syllabus Hadoop YARN Setup Programming in YARN framework j Understanding big data and Hadoop Big Data Limitations and Solutions of existing Data Analytics Architecture Hadoop Features Hadoop Ecosystem

More information

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model. 1. MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model. MapReduce is a framework for processing big data which processes data in two phases, a Map

More information

Hadoop Development Introduction

Hadoop Development Introduction Hadoop Development Introduction What is Bigdata? Evolution of Bigdata Types of Data and their Significance Need for Bigdata Analytics Why Bigdata with Hadoop? History of Hadoop Why Hadoop is in demand

More information

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns STATS 700-002 Data Analysis using Python Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns Unit 3: parallel processing and big data The next few lectures will focus on big

More information

exam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0

exam.   Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0 70-775.exam Number: 70-775 Passing Score: 800 Time Limit: 120 min File Version: 1.0 Microsoft 70-775 Perform Data Engineering on Microsoft Azure HDInsight Version 1.0 Exam A QUESTION 1 You use YARN to

More information

Oracle Big Data Fundamentals Ed 2

Oracle Big Data Fundamentals Ed 2 Oracle University Contact Us: 1.800.529.0165 Oracle Big Data Fundamentals Ed 2 Duration: 5 Days What you will learn In the Oracle Big Data Fundamentals course, you learn about big data, the technologies

More information

Configuring and Deploying Hadoop Cluster Deployment Templates

Configuring and Deploying Hadoop Cluster Deployment Templates Configuring and Deploying Hadoop Cluster Deployment Templates This chapter contains the following sections: Hadoop Cluster Profile Templates, on page 1 Creating a Hadoop Cluster Profile Template, on page

More information

CompSci 516: Database Systems

CompSci 516: Database Systems CompSci 516 Database Systems Lecture 12 Map-Reduce and Spark Instructor: Sudeepa Roy Duke CS, Fall 2017 CompSci 516: Database Systems 1 Announcements Practice midterm posted on sakai First prepare and

More information

HADOOP FRAMEWORK FOR BIG DATA

HADOOP FRAMEWORK FOR BIG DATA HADOOP FRAMEWORK FOR BIG DATA Mr K. Srinivas Babu 1,Dr K. Rameshwaraiah 2 1 Research Scholar S V University, Tirupathi 2 Professor and Head NNRESGI, Hyderabad Abstract - Data has to be stored for further

More information

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop

More information

Apache Flink Big Data Stream Processing

Apache Flink Big Data Stream Processing Apache Flink Big Data Stream Processing Tilmann Rabl Berlin Big Data Center www.dima.tu-berlin.de bbdc.berlin rabl@tu-berlin.de XLDB 11.10.2017 1 2013 Berlin Big Data Center All Rights Reserved DIMA 2017

More information

MapReduce: Algorithm Design for Relational Operations

MapReduce: Algorithm Design for Relational Operations MapReduce: Algorithm Design for Relational Operations Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec Projection π Projection in MapReduce Easy Map over tuples, emit

More information

Distributed Computation Models

Distributed Computation Models Distributed Computation Models SWE 622, Spring 2017 Distributed Software Engineering Some slides ack: Jeff Dean HW4 Recap https://b.socrative.com/ Class: SWE622 2 Review Replicating state machines Case

More information

Expert Lecture plan proposal Hadoop& itsapplication

Expert Lecture plan proposal Hadoop& itsapplication Expert Lecture plan proposal Hadoop& itsapplication STARTING UP WITH BIG Introduction to BIG Data Use cases of Big Data The Big data core components Knowing the requirements, knowledge on Analyst job profile

More information

Part A: MapReduce. Introduction Model Implementation issues

Part A: MapReduce. Introduction Model Implementation issues Part A: Massive Parallelism li with MapReduce Introduction Model Implementation issues Acknowledgements Map-Reduce The material is largely based on material from the Stanford cources CS246, CS345A and

More information

DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI

DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI Department of Information Technology IT6701 - INFORMATION MANAGEMENT Anna University 2 & 16 Mark Questions & Answers Year / Semester: IV / VII Regulation: 2013

More information

Map Reduce Group Meeting

Map Reduce Group Meeting Map Reduce Group Meeting Yasmine Badr 10/07/2014 A lot of material in this presenta0on has been adopted from the original MapReduce paper in OSDI 2004 What is Map Reduce? Programming paradigm/model for

More information

Practical Big Data Processing An Overview of Apache Flink

Practical Big Data Processing An Overview of Apache Flink Practical Big Data Processing An Overview of Apache Flink Tilmann Rabl Berlin Big Data Center www.dima.tu-berlin.de bbdc.berlin rabl@tu-berlin.de With slides from Volker Markl and data artisans 1 2013

More information

Hadoop Online Training

Hadoop Online Training Hadoop Online Training IQ training facility offers Hadoop Online Training. Our Hadoop trainers come with vast work experience and teaching skills. Our Hadoop training online is regarded as the one of the

More information