Timeline Dec 2004: Dean/Ghemawat (Google) MapReduce paper 2005: Doug Cutting and Mike Cafarella (Yahoo) create Hadoop, at first only to extend Nutch (

Size: px

Start display at page:

Download "Timeline Dec 2004: Dean/Ghemawat (Google) MapReduce paper 2005: Doug Cutting and Mike Cafarella (Yahoo) create Hadoop, at first only to extend Nutch ("

Beverley Welch
6 years ago
Views:

1 HADOOP Lecture 5

2 Timeline Dec 2004: Dean/Ghemawat (Google) MapReduce paper 2005: Doug Cutting and Mike Cafarella (Yahoo) create Hadoop, at first only to extend Nutch (the name is derived from Doug s son s toy elephant) 2006: Yahoo runs Hadoop on 5-20 nodes March 2008: Cloudera founded July 2008: Hadoop wins TeraByte sort benchmark (1 st time a Java program won this competition) April 2009: Amazon introduce Elastic MapReduce as a service on S3/EC2

3 Timeline June 2011: Hortonworks founded 27 dec 2011: Apache Hadoop release June 2012: Facebook claim biggest Hadoop cluster, totalling more than 100 PetaBytes in HDFS 2013: Yahoo runs Hadoop on 42,000 nodes, computing about 500,000 MapReduce jobs per day 15 oct 2013: Apache Hadoop release (YARN) 2014 feb: Apache Spark adopted as Apache project

4 Hadoop Scalable A cluster can be expanded by adding new servers or resources without having to move, reformat, or change the dependent analytic workflows or applications. Flexible is schema-less and can absorb any type of data, structured or not, from any number of sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analysis.

5 Hadoop Cost Effective Brings massively parallel computing to commodity servers. Fault Tolerant When you lose a node, the system redirects work to another location of the data and continues processing.

6 Who uses Hadoop? Adobe Facebook Google Cisco Yahoo!. do you know some company?

Commodity Hardware Typically in 2 level architecture Nodes are commodity servers 30-40 nodes/rack Uplink from rack is up

7 Commodity Hardware Typically in 2 level architecture Nodes are commodity servers nodes/rack Uplink from rack is up to 8 gigabit Rack-internal is 1 gigabit Aggregation switch Rack switch

8 Framework Core The base Apache Hadoop framework is composed of the following modules: Hadoop Common contains libraries and utilities needed by other Hadoop modules; Hadoop Distributed File System (HDFS) a distributed filesystem that stores data on commodity machines, providing very high aggregate bandwidth across the cluster; Hadoop MapReduce a programming model for large scale data processing. Hadoop YARN a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users' applications;

9 Framework Wider Components The base Apache Hadoop ecosystem, besides core framework, has (not limited to) the following modules: Ambari, Zookeeper - managing and monitoring Hbase, Cassandra - database Hive, Pig data warehouse and query language Mahout machine learning Chukwa, Avro, Oozie, Giraph,...

10 Hadoop Ecosystem Hadoop is supplemented by an ecosystem of Apache open-source projects that extend the value of Hadoop and improve its usability

11 Hadoop Ecosystem source:

12 HDFS - Data Model Data is organized into files and directories Files are divided into uniform sized blocks (128 MB) and distributed across cluster nodes Replicate blocks to handle hardware failure Checksums of data for corruption detection and recovery Expose block placement so that computes can be migrated to data Large streaming reads and small random reads Facility for multiple clients to append to a file

13 HDFS Architecture

14 NameNode Manages File System Namespace Maps a file name to a set of blocks Maps a block to the DataNodes where it resides Cluster Configuration Management Replication Engine for Blocks

15 NameNode Metadata Metadata in Memory The entire metadata is in main memory No demand paging of metadata Types of metadata List of files List of Blocks for each file List of DataNodes for each block File attributes, e.g. creation time, replication factor A Transaction Log Records file creations, file deletions etc

16 DataNode A Block Server Stores data in the local file system (e.g. ext3) Stores metadata of a block (e.g. CRC) Serves data and metadata to Clients Block Report Periodically sends a report of all existing blocks to the NameNode Facilitates Pipelining of Data Forwards data to other specified DataNodes

17 Block Placement Default Strategy One replica on local node Second replica on same rack Third replica on remote rack Additional replicas are randomly placed Clients read from nearest replicas

18 Heartbeats DataNodes send hearbeat to the NameNode Once every 3 seconds NameNode uses heartbeats to detect DataNode failure

19 Replication Engine NameNode detects DataNode failures Chooses new DataNodes for new replicas Balances disk usage Balances communication traffic to DataNodes

20 Data Correctness Use Checksums to validate data Use CRC32 File Creation Client computes checksum per 512 bytes DataNode stores the checksum File access Client retrieves the data and checksum from DataNode If Validation fails, Client tries other replicas

21 NameNode Failure A single point of failure Transaction Log stored in multiple directories A directory on the local file system A directory on a remote file system (NFS) Need to develop a real HA solution

22 Data Pipelining Client retrieves a list of DataNodes on which to place replicas of a block Client writes block to the first DataNode The first DataNode forwards the data to the next node in the Pipeline When all replicas are written, the Client moves on to write the next block in file

23 Secondary NameNode Copies FsImage and Transaction Log from Namenode to a temporary directory Merges FSImage and Transaction Log into a new FSImage in temporary directory Uploads new FSImage to the NameNode Transaction Log on NameNode is purged

24 MapReduce MapReduce is a programming model for efficient distributed computing It works like a Unix pipeline cat input grep sort uniq -c cat > output Input Map Shuffle & Sort Reduce Output Efficiency from Streaming through data, reducing seeks Pipelining A good fit for a lot of applications Log processing Web index building

25 MapReduce Task 1 Task 2 Aggregated Result Output data Task 3 Aggregated Result

26 MapReduce - Dataflow

27 MapReduce - Features Fine grained Map and Reduce tasks Improved load balancing Faster recovery from failed tasks Automatic re-execution on failure In a large cluster, some nodes are always slow or flaky Framework re-executes failed tasks Locality optimizations With large data, bandwidth to data is a problem Map-Reduce + HDFS is a very effective solution Map-Reduce queries HDFS for locations of input data Map tasks are scheduled close to the inputs when possible

28 Pig Started at Yahoo! Research Now runs about 30% of Yahoo! s jobs Features Expresses sequences of MapReduce jobs Data model: nested bags of items Provides relational (SQL) operators (JOIN, GROUP BY, etc.) Easy to plug in Java functions

29 An Example Suppose you have user data in a file, website visits in the second one, and you need to find the top n most visited pages by users (potentially students) aged Load Users Filter by age Load Pages Join on name Group on url Count clicks Order by clicks Take top n

30 Pig Latin Users = load users as (name, age); Filtered = filter Users by age >= 17 and age <= 24; Pages = load pages as (user, url); Joined = join Filtered by name, Pages by user; Grouped = group Joined by url; Summed = foreach Grouped generate group, count(joined) as clicks; Sorted = order Summed by clicks desc; TopN = limit Sorted n; store TopN into topnsites ;

31 Translation Load Users Filter by age Load Pages Join on name Group on url Count clicks Order by clicks Users = load Filtered = filter Pages = load Joined = join Grouped = group Summed = count() Sorted = order TopN = limit Take top n

32 Translation Load Users Load Pages Job 1 Filter by age Job 2 Job 3 Join on name Group on url Count clicks Order by clicks Take top n Users = load Filtered = filter Pages = load Joined = join Grouped = group Summed = count() Sorted = order TopN = limit

33 HBase Modeled on Google s Bigtable Row/column store Billions of rows/millions on columns Column-oriented - nulls are free Untyped - stores byte[] Already Discussed: persistent, distributed, sorted, multidimensional, sparse

34 HBase - Data Model Column families - Physically, all column family members are stored together on the filesystem Column qualifiers - added to a column family to provide the index for a given piece of data

35 HBase - Data Storage Column family anchor: Column family contents:

36 HBase - Code HTable table = Text row = new Text( enclosure1 ); Text col1 = new Text( animal:type ); Text col2 = new Text( animal:size ); BatchUpdate update = new BatchUpdate(row); update.put(col1, lion.getbytes( UTF-8 )); update.put(col2, big.getbytes( UTF-8)); table.commit(update); update = new BatchUpdate(row); update.put(col1, zebra.getbytes( UTF-8 )); table.commit(update);

37 HBase - Querying Retrieve a cell Cell = table.getrow( enclosure1 ).getcolumn( animal:type ).getvalue(); Retrieve a row RowResult = table.getrow( enclosure1 ); Scan through a range of rows Scanner s = table.getscanner( new String[] { animal:type } );

38 Hive Developed at Facebook Used for majority of Facebook jobs Relational database built on Hadoop Maintains list of table schemas SQL-like query language (HiveQL) Can call Hadoop Streaming scripts from HiveQL Supports table partitioning, clustering, complex data types, some optimizations

39 Hive Table CREATE TABLE page_views(viewtime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'User IP address') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) STORED AS SEQUENCEFILE; Partitioning breaks table into separate files for each (dt, country) pair Ex: /hive/page_view/dt= ,country=ro /hive/page_view/dt= ,country=fr

40 A Simple Query Find all page views coming from xyz.com on April 3 rd : SELECT page_views.* FROM page_views WHERE page_views.date >= ' ' AND page_views.date <= ' ' AND page_views.referrer_url like '%xyz.com'; Hive only reads partition ,instead of scanning entire table

41 Count users who visited each page by gender: SELECT pv.page_url, u.gender, COUNT(DISTINCT u.id) FROM page_views pv JOIN user u ON (pv.userid = u.id) GROUP BY pv.page_url, u.gender WHERE pv.date = ' '; Sample output: Aggregation and Joins

42 Hadoop: v1 vs v2 Single Use System Batch Apps HADOOP 1.0 Multi Purpose Platform Batch, Interactive, Online, Streaming, HADOOP 2.0 MapReduce (data processing) Others MapReduce (cluster resource management & data processing) YARN (cluster resource management) HDFS (redundant, reliable storage) HDFS2 (redundant, highly-available & reliable storage)

YARN Architecture ResourceManager (RM) Central agent - Manages and allocates cluster resources; Consists of Scheduler and ApplicationsManager NodeManager

43 YARN Architecture ResourceManager (RM) Central agent - Manages and allocates cluster resources; Consists of Scheduler and ApplicationsManager NodeManager (NM) Per-Node agent - Manages and enforces node resource allocations ApplicationMaster (AM) Per-Application Manages application lifecycle and task scheduling

44 YARN: More Than Batch Store ALL DATA in one place Interact with that data in MULTIPLE WAYS with Predictable Performance and Quality of Service

45 Key Improvements in YARN Framework supporting multiple applications Separate generic resource brokering from application logic Define protocols/libraries and provide a framework for custom application development Share same Hadoop Cluster across applications Cluster Utilization Generic resource container model replaces fixed Map/Reduce slots. Container allocations based on locality, memory Sharing cluster among multiple application

46 Key Improvements in YARN Multi-tenancy allows multiple access engines (either open-source or proprietary) to use Hadoop as the common standard for batch, interactive and real-time engines that can simultaneously access the same data set. Scalability Data center processing power continues to rapidly expand. YARN s ResourceManager focuses exclusively on scheduling and keeps pace as clusters expand to thousands of nodes managing petabytes of data. Compatibility Existing MapReduce applications developed for Hadoop 1 can run YARN without any disruption to existing processes that already work

47 YARN Eco-system Applications Powered by YARN Apache Giraph Graph Processing Apache Hama - BSP Apache Hadoop MapReduce Batch Apache Tez Batch/Interactive Apache S4 Stream Processing Apache Samza Stream Processing Apache Storm Stream Processing Apache Spark Iterative applications Elastic Search Scalable Search Cloudera Llama Impala on YARN DataTorrent Data Analysis HOYA HBase on YARN There's an app for that... YARN App Marketplace! Frameworks Powered By YARN Apache Twill REEF by Microsoft Spring support for Hadoop 2

48 YARN Application Lifecycle Application Client Protocol Resource Manager Application Client YarnClient App Specific API Application Master Protocol Application Master AMRMClient NMClient NodeManager App Container Container Management Protocol

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team Introduction to Hadoop Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since