MIT805 BIG DATA MAPREDUCE

Size: px

Start display at page:

Download "MIT805 BIG DATA MAPREDUCE"

Terence Barnett
5 years ago
Views:

1 MIT805 BIG DATA MAPREDUCE Christoph Stallmann Department of Computer Science University of Pretoria

2 Admin Part 2 & 3 of the assignment Team registrations

3 Concept Roman Empire

4 Concept Roman Empire

5 Concept Roman Empire 1. Send out messenger from Rome to each province s capital 2. Each capital sends out messengers to cities and villages 3. Each city and village counts their citizens 4. A single number is returned with the messenger to the capital 5. Each capital sums up the count for all cities/villages 6. A messenger returns a single count for each province to Rome 7. In Rome the total count is calculated

6 Concept Sandwich Video

7 MapReduce Overview MapReduce is a programming model for processing big datasets in a distributed environment

8 MapReduce History Concepts of map and reduce have been around since the 1980s Similar to the reduce and scatter functions of MPIs Similar to the map and reduce functions in functional programming MapReduce originally referred to Google s proprietary software In 2014 Google changed their big data processing from MapReduce to Cloud Dataflow

9 MapReduce Languages Most functional programming languages have built-in map and reduce functions R Python Scala Extensions and libraries now MapReduce in many other languages Java C++ Scheme Many more

10 MapReduce Systems Apache Hadoop (free, open-source, most widely used) Apache Spark (free, open-source, difference to Hadoop?) Apache HBase (free, open-source, difference to Hadoop?) Apache Hive (free, open-source, difference to Hadoop?) Apache Pig (free, open-source, difference to Hadoop?) Apache Tez (free, open-source, difference to Hadoop?) Apache Storm (free, open-source, difference to Hadoop?) Apache Apex (free, open-source, difference to Hadoop?)

11 Other Non-Apache Big Data Systems Ceph DataTorrent RTS Disco Google BigQuey High-Performance Computing Cluster (HPCC) Hydra Pachyderm Presto

12 MapReduce Responsibilities User/programmer responsibilities: Provide input data Code a map() function Code a reduce() function

13 MapReduce Responsibilities MapReduce system responsibility: Distribution and workload management Communication and networking Execution of code

14 MapReduce Functions MapReduce has two main functions: map() reduce() Two additional functions are implicitly called split() shuffle()

15 Split Function If the data is not already in smaller subsets, the split function divides the dataset into subset Needed to distribute a large dataset across multiple nodes

16 Map Function Map takes key-value pairs from one domain and returns a list of pairs in another domain map(key1, val1) list(key2, val2)

17 Shuffle Function Shuffle takes the map output pairs and groups them according to their key (key2) Shuffle then forwards each group to a reducer

18 Reduce Function Reduce takes the pair groups from the shuffle function, and applies summarization to them according to their key reduce(key2, list(val2)) list(val3) The reduce function typically returns a single value However, multiple outputs are possible

19 MapReduce Architecture

20 Example Water Restrictions Input: A bunch of files with random water level restrictions for each city Output: The maximum water restriction level for each city in 2018

21 Example Water Restrictions (Splitting) Jan 2018 Water Restriction Database split() Feb Dec

22 Example Water Restrictions January February March April May June July August September October November December

23 Example Water Restrictions January Date City Level Cape Town Pretoria Johannesburg Durban Pretoria Cape Town Durban Cape Town 5

24 Example Water Restrictions (Mapping) Jan Feb Dec map() map() map() <capetown, 5> <pretoria, 4> <johannesburg, 3> <durban, 1> <capetown, 5> <pretoria, 4> <johannesburg, 3> <durban, 1> <capetown, 5> <pretoria, 4> <johannesburg, 3> <durban, 1> Input data contains multiple entries for each city Output provides a single map with a <key, value> pair, only 1 per city

25 Example Water Restrictions (Shuffling) <capetown, 5> <pretoria, 4> <johannesburg, 3> <durban, 1> Shuffel <capetown, 4> <capetown, 5> <capetown, 6> <capetown, 5> <capetown, 4> <pretoria, 3> <johannesburg, 3> <durban, 2> <pretoria, 4> <pretoria, 2> <pretoria, 3> <pretoria, 2> <capetown, 5> <pretoria, 4> <johannesburg, 3> <durban, 1> <durban, 2> <durban, 1> <durban, 3> <durban, 1>

26 Example Water Restrictions (Reducing) <capetown, 4> <capetown, 5> <capetown, 6> <capetown, 5> reduce() <capetown, 6> <pretoria, 4> <pretoria, 2> <pretoria, 3> <pretoria, 2> reduce() <pretoria, 4> <capetown, 6> <pretoria, 4> <johannesburg, 3> <durban, 2> <durban, 2> <durban, 1> <durban, 3> <durban, 1> reduce() <durban, 3>

27 HDFS Hadoop File System (HDFS) supports rapid transfer of data between nodes A custom high-level file system for datasets in Hadoop Similar to local file systems (NTFS, FAT, EXT, etc) Data is broken up into blocks Faster transfer between nodes Highly fault tolerant Distributed data, redundancy, copies

28 HDFS

29 Hadoop Nodes Nodes/machines in Hadoop can be Data nodes only (running HDFS and providing the data) Client nodes only (executing Map or Reduce or both) Hybrid nodes (storing data and doing the processing) In addition, Hadoop has Name nodes Keeps track where data is stored, that is on which data nodes the data is stored Is a single point of failure in Hadoop

30 Centrality vs Distribution Function Architecture Comment Split Centralized Part of HDFS and might only have to be executed once Map Distributed Shuffle Centralized / Distributed Depending on the implementation, shuffle can be distributed, but is more commonly centralized Reduce Distributed

31 MapReduce vs RDBMS Pavlo et al, 2009, A comparison of approaches to large-scale data analysis, Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pp Vertica: Column-oriented massively parallel processing database with machine learning features DBMS-X: Row-oriented relational database management system Hadoop: MapReduce system

32 MapReduce vs RDBMS

33 MapReduce vs RDBMS

34 MapReduce vs RDBMS

35 MapReduce vs RDBMS

36 MapReduce vs RDBMS

37 MapReduce with Machine Learning MapReduce tasks must have an acyclic data flow Map function must be stateless Reduce function must be stateless Why is statelessness required? Machine learning is difficult to combine with MapReduce ML often requires continues querying of datasets ML often requires to keep states

38 Why Use MapReduce? If MapReduce is slower than RDBMSs for many tasks and does not nicely fit together with machine learning: Why then use MapReduce?

39 MIT805 BIG DATA MAPREDUCE Christoph Stallmann Department of Computer Science University of Pretoria

Big Data Hadoop Stack

Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware