CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

Similar documents
APACHE HIVE CIS 612 SUNNIE CHUNG

Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data

Map Reduce Group Meeting

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Big Data Management and NoSQL Databases

BIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG

Parallel Computing: MapReduce Jin, Hai

Large Scale OLAP. Yifu Huang. 2014/11/4 MAST Scientific English Writing Report

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

CS 345A Data Mining. MapReduce

Big Data Hive. Laurent d Orazio Univ Rennes, CNRS, IRISA

L22: SC Report, Map Reduce

SURVEY ON BIG DATA TECHNOLOGIES

Query processing on raw files. Vítor Uwe Reus

Hive SQL over Hadoop

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Apache Hive for Oracle DBAs. Luís Marques

Map Reduce. Yerevan.

Parallel Programming Concepts

Shark: Hive (SQL) on Spark

Data Storage Infrastructure at Facebook

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

1. Introduction to MapReduce

Motivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

CS 61C: Great Ideas in Computer Architecture. MapReduce

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

The MapReduce Abstraction

The MapReduce Framework

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

Lecture 7 (03/12, 03/14): Hive and Impala Decisions, Operations & Information Technologies Robert H. Smith School of Business Spring, 2018

Parallel Nested Loops

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Chapter 5. The MapReduce Programming Model and Implementation

Big Data Hadoop Course Content

CA485 Ray Walshe Google File System

Map-Reduce. Marco Mura 2010 March, 31th

MI-PDB, MIE-PDB: Advanced Database Systems

Mitigating Data Skew Using Map Reduce Application

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

South Asian Journal of Engineering and Technology Vol.2, No.50 (2016) 5 10

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

CS 345A Data Mining. MapReduce

Large-Scale GPU programming

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

MapReduce: Simplified Data Processing on Large Clusters

Introduction to MapReduce

How to Implement MapReduce Using. Presented By Jamie Pitts

Motivation: Building a Text Index. CS 347 Distributed Databases and Transaction Processing Distributed Data Processing Using MapReduce

CS November 2017

Distributed Systems. 18. MapReduce. Paul Krzyzanowski. Rutgers University. Fall 2015

CS427 Multicore Architecture and Parallel Computing

Going beyond MapReduce

Data Processing in Cloud with AVL Structure and Bootstrap Access Control

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

HADOOP FRAMEWORK FOR BIG DATA

Map-Reduce. John Hughes

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

MapReduce. U of Toronto, 2014

Query Execution Performance Analysis of Big Data Using Hive and Pig of Hadoop

MapReduce Simplified Data Processing on Large Clusters

CS /5/18. Paul Krzyzanowski 1. Credit. Distributed Systems 18. MapReduce. Simplest environment for parallel processing. Background.

Machine learning with big data in the Hadoop Ecosystem for Scientific Computing

Data-intensive computing systems

MapReduce: A Programming Model for Large-Scale Distributed Computation

7. Query Processing and Optimization

Importing and Exporting Data Between Hadoop and MySQL

Programming Systems for Big Data

An introduction to Big Data. Presentation by Devesh Sharma, Zubair Asghar & Andreas Aalsaunet

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Cloud Computing & Visualization

Oracle Big Data Connectors

Distributed File Systems II

MapReduce: Algorithm Design for Relational Operations

Introduction to Hive Cloudera, Inc.

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Modeling and evaluation on Ad hoc query processing with Adaptive Index in Map Reduce Environment

APPRAISAL AND ANALYSIS ON VARIOUS BIG DATA TECHNOLOGIES

microsoft

Introduction to MapReduce

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

MapReduce: Simplified Data Processing on Large Clusters. By Stephen Cardina

CLOUD-SCALE FILE SYSTEMS

Guoping Wang and Chee-Yong Chan Department of Computer Science, School of Computing National University of Singapore VLDB 14.

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Shark: SQL and Rich Analytics at Scale. Reynold Xin UC Berkeley

Hadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads

CS November 2018

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Introduction to Data Management CSE 344

Map-Reduce (PFP Lecture 12) John Hughes

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Transcription:

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin Presented by: Suhua Wei Yong Yu

Papers: MapReduce: Simplified Data Processing on Large Clusters 1 --Jeffrey Dean and Sanjay Ghemawat Introduction Model Implementation Performance Hive A Warehousing Solution over a Map-Reduce Framework 2 --Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao,Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff and Raghotham Murthy Introduction Hive Database Hive Architecture Demonstration Description 1, https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf 2, http://202.118.11.61/papers/db%20in%20the%20cloud/hive.pdf

Introduction Background: google, past 5 years @2004 Hundreds of special-purpose computations: -- To process large amount of raw data:crawled documents, web request logs, etc. The computations have to be distributed across hundreds of machines -- Most computations are conceptually straightforward, but input data is large: (3,288TB /29,423 jobs ~ 100GB/job) Issues: -- How to parallelize the computation, distribute the data, and handle failures

Solution: MapReduce Designed a new abstraction --to express the simple computations: hides the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library. Use of a functional model -- inspired by the map and reduce primitives present in Lisp --specified map and reduce operations to parallelize large computation easily and use re-execution as the primary mechanism for fault tolerance Contributions: --Simple and powerful interface on large clusters of commodity of PCs: automatic parallelization and distribution of larger-scale computations

Programming Model Computation: MapReduce library --Take a set of input key/value pairs --Produce a set of output key/value pairs Map: takes an input pair and produces a set of intermediate key/value pairs -- groups together all intermediate values associated with the same intermediate key I and passes them to reduce function Reduce: accepts an intermediate key I and a set of value for that key -- merges together these values to form a possibly smaller set of values

Example Counting the # of occurrences of each words

Examples of MapReduce Computations Distributed Grep Map: Emits the certain line that matches a supplied pattern Reduce: Identity function, copies the intermediate data to output Count of URL Access Frequency Map: Process logs of web page requests and output(url, 1) Reduce: Adds all values and emits (URL, total count) pair Reverse Web-Link Graph Term-Vector Per Host Inverted Index Distributed Sort

Implementation Different implementations depends on the environments --small shared-memory machine; large NUMA multi-processor; large collection of networked machines In google s environment: --x86 processors; Linux; 2-4 GB of memory --Commodity networking hardware --Cluster consists of hundreds or thousands of machines --Storage: inexpensive IDE disks --Users submit jobs to a scheduling system

Execution Overview

Sequence of Actions 1, The input files are splited into M pieces, 16 ~ 64M per piece 2, Master assigns works to workers 3, worker reads the contents of the input split 4, The buffered pairs are written to local disk 5, Master read the buffered data, reduce works sort all intermediate data, group key and value 6, The reduce worker passes the key and values to reduce function 7, Master wakes up the user program

Fault Tolerance Tolerate machine failure gracefully --very large amount of data & hundreds or thousands of machines Worker Failure: -- The master pings every worker periodically: worker is failed with no response Any map/reduce task on a failed worker is reset to idle Master Failure: --Master write periodic checkpoints of the master data structure Master task dies: a new copy can be started from last checkpoint

Performance Cluster: 1800 machines, two 2GHz Intel Xeon processors, 4GB memory, two 160GB IDE disk, a gigabit Ethernet link Grep: 10 10 100-byte records (1TB) three-character pattern (92,337 records) Sort:

Experience Has been used a cross a wide range of domains Large-scale machine learning problems Clustering problems for Google news and Froogle products Extraction of data used to product reports of popular queries Extraction of properties of web pages for new experiments and products Large-scale graph computations

Hive A Warehouse Solution Over a Map-Reduce Framework By Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyskoff and Raghotham Murthy Face Book Data Infrastructure Team Presented by Suhua Wei, Yong Yu

Introduction The map-reduce programing model is very low level and requires developers to write custom programs which are hard to maintain and reuse Build on the top of Hadoop Supports queries expressed in a SQL-like declarative language-hiveql HiveQL

Hive Database Data Model Tables Analogous to tables in relational database Each table has a corresponding HDFS directory Hive provides built-in serialization formats which exploit compression and lazy-serialization Partitions Each table can have one or more partitions Example: table T in the directory : /wh/t. If Tis partitioned on columns ds = 20090101, and ctry = US, will be stored /wh/t/ds=20090101/ctry=us. Buckets Data in each partition may in turn be divided into buckets based on the hash of a column in the table Each bucket is stored as a file in the partition directory

Hive Database Query Language HiveQL Supports select, project, join, aggregate, union all and subqueries in the from clause Supports data definition (DDL) statements and data manipulation (DML) statements like load and insert (except for updating and deleting) Supports user defined column transformation (USF) and aggregation(udaf) functions implemented in java Users can embed custom map-reduce scripts written in any language using a simple row-based streaming interface

Hive Database Running time example: Status Meme When Facebook users update their status, the updates are logged into flat files in an NFS directory /logs/status_updates Compute daily statistics on the frequency of status updates based on gender and school

Hive Architecture External interface: Both user interface like command line (cli) and web UI Thrift is a framework for cross-language services, where a server written in one language (like Java) can also support clients in other languages. Metastore is the system catalog. All other components of Hive interact with metastore The Driver manages the life cycle (statistics) of a HiveQL statement during compilation, optimization and execution Figure 1: Hive Architecture

Hive Architecture Bottom Top Figure 2: Query plan with 3 map-reduce jobs for multi-table insert query

Hive Architecture MetaStore The system catalog which contains metadata about the tables stored in Hive This data is specified during table creation and reused very time the table is referenced in HiveQL Contains the following objects database : the namespace for tables table : metadata for table contains list of columns and their types, owners, storage and SerDe information Partition: each partition can have its own columns and SerDe and storage information

Hive Architecture Compile The compiler converts the string(ddl/dml/query statement) to a plan. The parser transforms a query string to a parse tree representation The semantic analyzer transforms the parse tree to a block-based internal query representation The logical plan generator converts the internal query represnetation to a logical plan The optimizer performs multiple passes over the logical plan and rewrites it in several ways Combined multiple joins which share the join key into a single multi-way join, and hence a single map-reduce job adds repartition operators Prunes columns early and pushes predicates closer to the table scan operators

Hive Architecture Compile (continue..) The optimizer performs multiple passes over the logical plan and rewrites it in several ways Combined multiple joins which share the join key into a single multi-way join, and hence a single map-reduce job adds repartition operators Prunes columns early and pushes predicates closer to the table scan operators In case of partitioned tables, prunes partitions that are not needed by the query In case of sampling queries, prunes buckets that are not needed Users can also provide hints to the optimizer to Add partial aggregation operators to handle large cardinality grouped aggregation Add repartition operators to handle skew in grouped aggregations Perform joins in the map phrase instead of the reduce phase The Physical Plan generator converts the logical plan into physical plan, consisting a directed-acyclic graph(dag)of map-reproduce jobs

Summary Hive is a first step in building an open-source warehouse over a web-scale map-reduce data processing system(hadoop), and work towards(2009) working towards subsume SQL syntax Hive has a naïve rule-based optimizer with a small number of simple rules. Plan to build a cost-based optimizer and adaptive optimization techniques Exploring columnar storage and more intelligent data placement to improve scan performance Enhancing the drivers for integration with commercial BI tools Exploring methods for multi-query optimization techniques.