Hadoop: The Definitive Guide

Similar documents
Expert Lecture plan proposal Hadoop& itsapplication

Hadoop & Big Data Analytics Complete Practical & Real-time Training

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Introduction to BigData, Hadoop:-

Big Data Hadoop Course Content

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Big Data Analytics using Apache Hadoop and Spark with Scala

Innovatus Technologies

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Hadoop Development Introduction

Hadoop. Introduction to BIGDATA and HADOOP

Hadoop Online Training

Hadoop course content

Big Data Hadoop Stack

Techno Expert Solutions An institute for specialized studies!

Big Data and Hadoop. Course Curriculum: Your 10 Module Learning Plan. About Edureka

HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation)

Certified Big Data and Hadoop Course Curriculum

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Hadoop An Overview. - Socrates CCDH

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Certified Big Data Hadoop and Spark Scala Course Curriculum

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

We are ready to serve Latest Testing Trends, Are you ready to learn.?? New Batches Info

Data Analytics Job Guarantee Program

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Hortonworks Data Platform

Hadoop. Introduction / Overview

BIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG

1 Big Data Hadoop. 1. Introduction About this Course About Big Data Course Logistics Introductions

Big Data Development HADOOP Training - Workshop. FEB 12 to (5 days) 9 am to 5 pm HOTEL DUBAI GRAND DUBAI

50 Must Read Hadoop Interview Questions & Answers

A complete Hadoop Development Training Program.

Hadoop-PR Hortonworks Certified Apache Hadoop 2.0 Developer (Pig and Hive Developer)

Hortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version :

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem.

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Hadoop. copyright 2011 Trainologic LTD

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Microsoft Big Data and Hadoop

Hadoop ecosystem. Nikos Parlavantzas

Configuring and Deploying Hadoop Cluster Deployment Templates

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Top 25 Hadoop Admin Interview Questions and Answers

1Z Oracle Big Data 2017 Implementation Essentials Exam Summary Syllabus Questions

Hadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop

Pig A language for data processing in Hadoop

ExamTorrent. Best exam torrent, excellent test torrent, valid exam dumps are here waiting for you

exam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0

Big Data Architect.

Apache Hive for Oracle DBAs. Luís Marques

Hortonworks PR PowerCenter Data Integration 9.x Administrator Specialist.

Exam Questions

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Prototyping Data Intensive Apps: TrendingTopics.org

microsoft

Department of Digital Systems. Digital Communications and Networks. Master Thesis

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Timeline Dec 2004: Dean/Ghemawat (Google) MapReduce paper 2005: Doug Cutting and Mike Cafarella (Yahoo) create Hadoop, at first only to extend Nutch (

BIG DATA COURSE CONTENT

South Asian Journal of Engineering and Technology Vol.2, No.50 (2016) 5 10

Going beyond MapReduce

Big Data Hadoop Certification Training

Oracle Big Data Fundamentals Ed 2

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

KillTest *KIJGT 3WCNKV[ $GVVGT 5GTXKEG Q&A NZZV ]]] QORRZKYZ IUS =K ULLKX LXKK [VJGZK YKX\OIK LUX UTK _KGX

Data Science Training

itpass4sure Helps you pass the actual test with valid and latest training material.

Hortonworks University. Education Catalog 2018 Q1

MapR Enterprise Hadoop

Oracle Big Data Fundamentals Ed 1

Lecture 7 (03/12, 03/14): Hive and Impala Decisions, Operations & Information Technologies Robert H. Smith School of Business Spring, 2018

Hive SQL over Hadoop

Vendor: Cloudera. Exam Code: CCD-410. Exam Name: Cloudera Certified Developer for Apache Hadoop. Version: Demo

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

TIE Data-intensive Programming. Dr. Timo Aaltonen Department of Pervasive Computing

Hortonworks Certified Developer (HDPCD Exam) Training Program

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

April Copyright 2013 Cloudera Inc. All rights reserved.

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism


PROFESSIONAL. NoSQL. Shashank Tiwari WILEY. John Wiley & Sons, Inc.

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

A brief history on Hadoop

Cloudera Introduction

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

Big Data Analytics. Description:

HADOOP. K.Nagaraju B.Tech Student, Department of CSE, Sphoorthy Engineering College, Nadergul (Vill.), Sagar Road, Saroonagar (Mdl), R.R Dist.T.S.

A Survey on Big Data

Question: 1 You need to place the results of a PigLatin script into an HDFS output directory. What is the correct syntax in Apache Pig?

Cloudera Introduction

Map Reduce & Hadoop Recommended Text:

Integrating Big Data with Oracle Data Integrator 12c ( )

Transcription:

THIRD EDITION Hadoop: The Definitive Guide Tom White Q'REILLY Beijing Cambridge Farnham Köln Sebastopol Tokyo

labte of Contents Foreword Preface xv xvii 1. Meet Hadoop 1 Daw! 1 Data Storage and Analysis 3 Comparison with Other Systems 4 Rational Database Management System 4 Grid Computing 6 Volunteer Computing 8 ABrief History uf Hadoop 9 Apache Hadoop and the Hadoop Ecosystem 12 Hadoop Releases 13 What's Covered in This Book 15 Compatibility 15 2. MapReduce...................................... 17 A Weather Dataset Dara Format Analyzing the Data with Unix Tools Analyzing the Data with Hadoop Map and Reduce Java MapReduce Scaling Out Data Flow Combiner Functions Running a Distributed MapReduce Job Hadoop Streaming Ruby Python 17 17 19 20 20 22 30 30 33 36 36 36 39 v

Hadoop Pipes Compiling and Running 40 41 3. The Hadoop Distributed Filesystem 43 The Design of HDFS HDFS Concepts Blocks Namenodes and Datanodes HDFS Federation HDFS High-Availability The Command-Line Interface Basic Filesystem Operations Hadoop Filesystems Interfaces The Java Interface Reading Data from a Hadoop URL Reading Data Using the FileSystem API Writing Data Directories Querying the Filesystem Deleting Data Data Flow Anatorny of a File Read Anatomy of a File Write Coherency Model Data Ingest with Flume and Sqoop Parallel Copying with distcp Keeping an HDFS Cluster Balanced Hadoop Archives Using Hadoop Archives Limitations 43 45 45 46 47 48 49 50 52 53 55 55 57 60 62 62 67 67 67 70 72 74 75 76 77 77 79 4. Hadoop 1/0 81 Data Integrity 81 Data Integrity in HDFS 81 LocalFileSystem 82 ChecksumFileSystem 83 Compression 83 Codecs 85 Compression and Input Splits 89 Using Compression in MapReduce 90 Serialization 93 The Writable Interface 94 vi I labre ofcontents

Writable Classes Implementing a Custom Writable Serialization Frameworks Avro Avro Data Types and Schemas In-Memory Serialization and Deserialization Avro Datafiles Interoperability Schema Resolution Sort Order Avro MapReduce Sorting Using Avro MapReduce Avro MapReduce in Other Languages File-Based Data Structures SequenceFile MapFile 96 103 108 110 111 114 117 118 121 123 124 128 130 130 130 137 5. Developing amapreduce Application 143 The Configuration API Combining Resources Variable Expansion Setting Up the Development Environment Managing Configuration GenericOptionsParser, Tool, and ToolRunner Writing a Unit Test with MRUnit Mapper Reducer Running Locally on Test Data Running a job in a Local job Runner Testing the Driver Running on a Cluster Packaging a job 144 145 ]46 146 148 150 154 154 156 157 157 160 161 162 Launehing a job 163 The MapReduce Web ur ] 65 Retrieving the Results Debugging a job Hadoop Logs Rernote Debugging Tuning a job Profiling Tasks MapReduce Workflows Decomposing a Problem into MapReduce jobs jobcontrol 168 ]70 ]75 177 178 179 181 181 183 TableofContents I vii

Apache Oozie 183 6. How MapReduce Works 189 Anatomy of a MapReduce Job Run 189 Classic MapReduce (MapReduce 1) 190 YARN (MapReduce 2) 196 Failures 202 Failures in Classic MapReduce 202 Failures in YARN 204 Job Scheduling 206 The Fair Scheduler 207 The Capacity Scheduler 207 Shuffle and Sort 208 The Map Side 208 The Reduce Side 210 Configuration Tuning 211 Task Execution 214 The Task Execution Environment 215 Speculative Execution 215 Output Committers 217 Task JVM Reuse 219 Skipping Bad Records 220 7. MapReduce Types and Formats,, 223 MapReduce Types The Default MapReduce Job Input Formats Input Splits and Records Text Input Binary Input Multiple Inputs Database Input (and Output) Output Formats Text Output Binary Output Multiple Outputs Lazy Output Database Output 223 227 234 234 245 249 250 251 251 252 253 253 257 258 8. MapReduce Features 259 Counters 259 Built-in Counters 259 User-Defined Java Counters 264 viii I labte ofcontents

User-Defined Streaming Counters Sorting Preparation Partial Sort Total Sort Secondary Sort Joins Map-Side joins Reduce-Side Joins Side Data Distribution Using the Job Configuration Distributed Cache MapReduce Library Classes 268 268 269 270 274 277 283 284 285 288 288 289 295 9. Setting Up ahadoop Cluster 297 Cluster Specification Network Topology Cluster Setup and Installation Installing Java Creating a Hadoop User Installing Hadoop Testing the Installation SSH Configuration Hadoop Configuration Configuration Management Environment Settings Important Hadoop Daemon Properties Hadoop Daemon Addresses and Ports Other Hadoop Properties User Account Creation YARN Configuration Important YARN Daemon Properties YARN Daemon Addresses and Ports Security Kerberos and Hadoop Delegation Tokens Other Security Enhancements Benchmarking a Hadoop Cluster Hadoop Benchmarks User Jobs Hadoop in the Cloud Apache Whirr 297 299 301 302 302 302 303 303 304 305 307 311 316 317 320 320 321 324 325 326 328 329 331 331 333 334 334 Table ofcontents I ix

10. Administering Hadoop 339 HDFS Persistent Data Structures Safe Mode Audit Logging Tools Monitoring Logging Metries Java Management Extensions Maintenance Routine Administration Procedures Commissioning and Decommissioning Nodes Upgrades 339 339 344 346 347 351 352 352 355 358 358 359 362 11. Pig 367 Installing and Running Pig Execution Types Running Pig Programs Grunt Pig Latin Editors An Example Generating Examples Comparison with Databases Pig Latin Structure Statements Expressions Types Schemas Functions Macros User-Defined Functions A Filter UDF An Eval UDF A Load UDF Data Processing Operators Loading and Storing Data Filtering Data Grouping and Joining Data Sorting Data Combining and Splitting Data Pig in Practice 368 368 370 370 371 371 373 374 375 376 377 381 382 384 388 390 391 391 394 396 399 399 400 402 407 408 409 x I Table ofcontents

Parallelism Parameter Substitution 409 410 12. Hive 413 Installing Hive The Hive Shell An Example Running Hive Configuring Hive Hive Services The Metastore Comparison with Traditional Databases Schema on Read Versus Schema on Write Updates, Transactions, and Indexes HiveQL Data Types Operators and Functions Tables Managed Tables and External Tables Partitions and Buckets Storage Formats Importing Data Altering Tables Dropping Tables Querying Data Sorting and Aggregating MapReduce Scripts ]oins Subqueries Views User-Defined Functions Writing a UDF Writing a UDAF 414 415 416 417 417 419 421 423 423 424 425 426 428 429 429 431 435 441 443 443 444 444 445 446 449 450 451 452 454 13. HBase 459 HBasics 459 Backdrop 460 Concepts 460 Whirlwind Tour of the Data Model 460 Implementation 461 Installation 464 Test Drive 465 Clients 467 Table ofcontents I xi

Java Avro, REST, and Thrift Example Schemas Loading Data Web Queries HBase Versus RDBMS Successful Service HBase Use Case: HBase at Streamy.com Praxis Versions HDFS UI Metries Schema Design Counters Bulk Load 467 470 472 472 473 476 479 480 481 481 483 483 484 485 485 486 486 487 14. ZooKeeper..................... 489 Installing and Running ZooKeeper 490 An Example 492 Group Membership in ZooKeeper 492 Creating the Group 493 Joining a Group 495 Listing Members in a Group 496 Deleting a Group 498 The ZooKeeper Service 499 Data Model 499 Operations 501 Implementation 506 Consistency 507 Sessions 509 States 511 Building Applications with ZooKeeper 512 A Configuration Service 512 The Resilient ZooKeeper Application 515 A Lock Service 519 More Distributed Data Structures and Protocols 521 ZooKeeper in Production 522 Resilience and Performance 523 Configuration 524 xii I Table ofcontents

15. Sqoop 527 Getting Sqoop Sqoop Connectors A Sampie Import Text and Binary File Formats Generated Code Additional Serialization Systems Imports: A Deeper Look Controlling the Import Imports and Consistency Direct-mode Imports Working with Imported Data Imported Data and Hive Importing Large Objects Performing an Export Exports: A Deeper Look Exports and Transactionality Exports and SequenceFiles 527 529 529 532 532 533 533 535 536 536 536 537 540 542 543 545 545 16. Case Studies 547 Hadoop Usage at Last.frn Last.frn: The Social Music Revolution Hadoop at Last.frn Generating Charts with Hadoop The Track Statistics Program Summary Hadoop and Hive at Facebook Hadoop at Facebook Hypothetical Use Case Studies Hive Problems and Future Work Nutch Search Engine Data Structures Selected Examples of Hadoop Data Processing in Nutch Summary Log Processing at Rackspace Requirements/The Problem Brief History Choosing Hadoop Collection and Storage MapReduce for Logs Cascading Fields, Tuples, and Pipes 547 547 547 548 549 556 556 556 559 562 566 567 568 571 580 581 581 582 582 582 583 589 590 Table ofcontents I xiii

Operations 593 Taps, Sehemes, and Flows 594 Caseading in Praetiee 595 Flexibility 598 Hadoop and Caseading at ShareThis 599 S~m~ TeraByte Sort on Apache Hadoop 603 Using Pig and Wukong to Explore Billion-edge Network Graphs 607 Measuring Cornmunity 609 Everybody's Talkin' at Me: The Twitter Reply Graph 609 Symmetrie Links 612 Community Extraetion 613 A. Installing Apache Hadoop.............................................. 617 B. Cloudera's Distribution Including Apache Hadoop 623 C. Preparing the NCDC Weather Data 625 Index 629 6m xiv I Table ofcontents