Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Similar documents
Introduction to MapReduce. Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng.

Cloud Computing at Yahoo! Thomas Kwan Director, Research Operations Yahoo! Labs

The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012):

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Timeline Dec 2004: Dean/Ghemawat (Google) MapReduce paper 2005: Doug Cutting and Mike Cafarella (Yahoo) create Hadoop, at first only to extend Nutch (

A brief history on Hadoop

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

Scalable Web Programming. CS193S - Jan Jannink - 2/25/10

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Apache Hadoop.Next What it takes and what it means

Distributed Systems 16. Distributed File Systems II

Hadoop/MapReduce Computing Paradigm

Clustering Lecture 8: MapReduce

Data Analysis Using MapReduce in Hadoop Environment

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Distributed Filesystem

Distributed File Systems II

Map Reduce. Yerevan.

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Introduction to Hadoop. Scott Seighman Systems Engineer Sun Microsystems

CS 61C: Great Ideas in Computer Architecture. MapReduce

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

MapReduce. U of Toronto, 2014

BigData and Map Reduce VITMAC03

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

Distributed Systems CS6421

CS 345A Data Mining. MapReduce

A Survey on Big Data

MapReduce, Hadoop and Spark. Bompotas Agorakis

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

HADOOP FRAMEWORK FOR BIG DATA

A BigData Tour HDFS, Ceph and MapReduce

Chapter 5. The MapReduce Programming Model and Implementation

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Big Data and Cloud Computing

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Hadoop. copyright 2011 Trainologic LTD

Crossing the Chasm: Sneaking a parallel file system into Hadoop

CS60021: Scalable Data Mining. Sourangshu Bhattacharya

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

Introduction to MapReduce Algorithms and Analysis

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

HADOOP. K.Nagaraju B.Tech Student, Department of CSE, Sphoorthy Engineering College, Nadergul (Vill.), Sagar Road, Saroonagar (Mdl), R.R Dist.T.S.

Lecture 12 DATA ANALYTICS ON WEB SCALE

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Chase Wu New Jersey Institute of Technology

MI-PDB, MIE-PDB: Advanced Database Systems

Hadoop and HDFS Overview. Madhu Ankam

Map-Reduce. Marco Mura 2010 March, 31th

MapReduce-style data processing

Apache BookKeeper. A High Performance and Low Latency Storage Service

Google File System (GFS) and Hadoop Distributed File System (HDFS)

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

Lecture 11 Hadoop & Spark

Hadoop An Overview. - Socrates CCDH

Hadoop Distributed File System(HDFS)

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Jaql. Kevin Beyer, Vuk Ercegovac, Eugene Shekita, Jun Rao, Ning Li, Sandeep Tata. IBM Almaden Research Center

Introduction to MapReduce

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

A Review Paper on Big data & Hadoop

The Google File System (GFS)

Large-scale Information Processing

CA485 Ray Walshe Google File System

CLOUD-SCALE FILE SYSTEMS

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

The Google File System. Alexandru Costan

Performance Enhancement of Data Processing using Multiple Intelligent Cache in Hadoop

Hadoop. Introduction / Overview

CS November 2017

HDFS Architecture Guide

MapReduce. Cloud Computing COMP / ECPE 293A

Big Data Analytics Overview with Hadoop and Spark

The Google File System

Google: A Computer Scientist s Playground

Google: A Computer Scientist s Playground

Map Reduce & Hadoop Recommended Text:

Introduction to Hadoop and MapReduce

Scaling Up 1 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. Hadoop, Pig

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

CSE 124: Networked Services Fall 2009 Lecture-19

Transcription:

Introduction to Hadoop Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb 2006 Before Grid team, I worked on Yahoos WebMap VP of Apache for Hadoop Chair of Hadoop Program Management Committee Responsible for Building the Hadoop community Interfacing between the Hadoop PMC and the Apache Board

Problem How do you scale up applications? 100 s of terabytes of data Takes 11 days to read on 1 computer Need lots of cheap computers Fixes speed problem (15 minutes on 1000 computers), but Reliability problems In large clusters, computers fail every day Cluster size is not fixed Need common infrastructure Must be efficient and reliable

Solution Open Source Apache Project Hadoop Core includes: Distributed File System - distributes data Map/Reduce - distributes application Written in Java Runs on Linux, Mac OS/X, Windows, and Solaris Commodity hardware

Commodity Hardware Cluster Typically in 2 level architecture Nodes are commodity Linux PCs 40 nodes/rack Uplink from rack is 8 gigabit Rack-internal is 1 gigabit all-to-all

Distributed File System Single petabyte file system for entire cluster Managed by a single namenode. Files are written, read, renamed, deleted, but append-only. Optimized for streaming reads of large files. Files are broken in to large blocks. Transparent to the client Blocks are typically 128 MB Replicated to several datanodes, for reliability Client talks to both namenode and datanodes Data is not sent through the namenode. Throughput of file system scales nearly linearly. Access from Java, C, or command line.

Block Placement Default is 3 replicas, but settable Blocks are placed (writes are pipelined): On same node On different rack On the other rack Clients read from closest replica If the replication for a block drops below target, it is automatically re-replicated.

HDFS Dataflow

Data Correctness Data is checked with CRC32 File Creation Client computes checksum per 512 byte DataNode stores the checksum File access Client retrieves the data and checksum from DataNode If Validation fails, Client tries other replicas Periodic validation by DataNode

Map/Reduce Map/Reduce is a programming model for efficient distributed computing It works like a Unix pipeline: cat input grep sort uniq -c cat > output Input Map Shuffle & Sort Reduce Output Efficiency from Streaming through data, reducing seeks Pipelining A good fit for a lot of applications Log processing Web index building Data mining and machine learning

Map/Reduce Dataflow

Map/Reduce features Java, C++, and text-based APIs In Java use Objects and and C++ bytes Text-based (streaming) great for scripting or legacy apps Higher level interfaces: Pig, Hive, Jaql Automatic re-execution on failure In a large cluster, some nodes are always slow or flaky Framework re-executes failed tasks Locality optimizations With large data, bandwidth to data is a problem Map-Reduce queries HDFS for locations of input data Map tasks are scheduled close to the inputs when possible

Why Yahoo! is investing in Hadoop We started with building better applications Scale up web scale batch applications (search, ads, ) Factor out common code from existing systems, so new applications will be easier to write Manage the many clusters we have more easily The mission now includes research support Build a huge data warehouse with many Yahoo! data sets Couple it with a huge compute cluster and programming models to make using the data easy Provide this as a service to our researchers We are seeing great results! Experiments can be run much more quickly in this environment

Hadoop Timeline 2004 HDFS & map/reduce started in Nutch Dec 2005 Nutch ported to map/reduce Jan 2006 Doug Cutting joins Yahoo Feb 2006 Factored out of Nutch. Apr 2006 Sorts 1.9 TB on 188 nodes in 47 hours May 2006 Yahoo sets up research cluster Jan 2008 Hadoop is a top level Apache project Feb 2008 Yahoo creating Webmap with Hadoop Apr 2008 Wins Terabyte sort benchmark Aug 2008 Ran 4000 node Hadoop cluster

Running the Production WebMap Search needs a graph of the known web Invert edges, compute link text, whole graph heuristics Periodic batch job using Map/Reduce Uses a chain of ~100 map/reduce jobs Scale 100 billion nodes and 1 trillion edges Largest shuffle is 450 TB Final output is 300 TB compressed Runs on 10,000 cores Written mostly using Hadoop s C++ interface

Inverting the Webmap One job inverts all of the edges in the Webmap Finds the text in the links that point to each page. Input is cache of web pages Key: URL, Value: Page HTML Map output is Key: Target URL, Value: Source URL, Text from link Reduce adds column to URL table with set of linking pages and link text

Finding Duplicate Web Pages Find close textual matches in the corpus of web pages Use a loose hash based on visible text Input is same cache of web pages Key: URL; Value: Page HTML. Map output is Key: Page Text Hash, Goodness; Value: URL Reduce output is Key: Page Text Hash; Value: URL (best first) Second job resorts based on the URL Map output is URL, best URL Reduce creates a new column in URL table with best URL for duplicate pages

Research Clusters The grid team runs research clusters as a service to Yahoo researchers Analytics as a Service Mostly data mining/machine learning jobs Most research jobs are *not* Java: 42% Streaming Uses Unix text processing to define map and reduce 28% Pig Higher level dataflow scripting language 28% Java 2% C++

Hadoop clusters We have ~20,000 machines running Hadoop Our largest clusters are currently 2000 nodes Several petabytes of user data (compressed, unreplicated) We run hundreds of thousands of jobs every month

Research Cluster Usage

NY Times Needed offline conversion of public domain articles from 1851-1922. Used Hadoop to convert scanned images to PDF Ran 100 Amazon EC2 instances for around 24 hours 4 TB of input 1.5 TB of output Published 1892, copyright New York Times

Terabyte Sort Benchmark Started by Jim Gray at Microsoft in 1998 Sorting 10 billion 100 byte records Hadoop won general category in 209 seconds (prev was 297 ) 910 nodes 2 quad-core Xeons @ 2.0Ghz / node 4 SATA disks / node 8 GB ram / node 1 gb ethernet / node and 8 gb ethernet uplink / rack 40 nodes / rack Only hard parts were: Getting a total order Converting the data generator to map/reduce http://developer.yahoo.net/blogs/hadoop/2008/07

Hadoop Community Apache is focused on project communities Users Contributors write patches Committers can commit patches too Project Management Committee vote on new committers and releases too Apache is a meritocracy Use, contribution, and diversity is growing But we need and want more!

Size of Releases

Size of Developer Community

Size of User Community

Who Uses Hadoop? Amazon/A9 Facebook Google IBM Joost Last.fm New York Times PowerSet (now Microsoft) Quantcast Veoh Yahoo! More at http://wiki.apache.org/hadoop/poweredby

What s Next? 0.19 File Appends Total Order Sampler and Partitioner Pluggable scheduler 0.20 New map/reduce API Better scheduling for sharing between groups Splitting Core into sub-projects (HDFS, Map/Reduce, Hive) And beyond HDFS and Map/Reduce security High Availability via Zookeeper Get ready for Hadoop 1.0

Q&A For more information: Website: http://hadoop.apache.org/core Mailing lists: core-dev@hadoop.apache core-user@hadoop.apache IRC: #hadoop on irc.freenode.org