THIRD EDITION Hadoop: The Definitive Guide Tom White Q'REILLY Beijing Cambridge Farnham Köln Sebastopol Tokyo
labte of Contents Foreword Preface xv xvii 1. Meet Hadoop 1 Daw! 1 Data Storage and Analysis 3 Comparison with Other Systems 4 Rational Database Management System 4 Grid Computing 6 Volunteer Computing 8 ABrief History uf Hadoop 9 Apache Hadoop and the Hadoop Ecosystem 12 Hadoop Releases 13 What's Covered in This Book 15 Compatibility 15 2. MapReduce...................................... 17 A Weather Dataset Dara Format Analyzing the Data with Unix Tools Analyzing the Data with Hadoop Map and Reduce Java MapReduce Scaling Out Data Flow Combiner Functions Running a Distributed MapReduce Job Hadoop Streaming Ruby Python 17 17 19 20 20 22 30 30 33 36 36 36 39 v
Hadoop Pipes Compiling and Running 40 41 3. The Hadoop Distributed Filesystem 43 The Design of HDFS HDFS Concepts Blocks Namenodes and Datanodes HDFS Federation HDFS High-Availability The Command-Line Interface Basic Filesystem Operations Hadoop Filesystems Interfaces The Java Interface Reading Data from a Hadoop URL Reading Data Using the FileSystem API Writing Data Directories Querying the Filesystem Deleting Data Data Flow Anatorny of a File Read Anatomy of a File Write Coherency Model Data Ingest with Flume and Sqoop Parallel Copying with distcp Keeping an HDFS Cluster Balanced Hadoop Archives Using Hadoop Archives Limitations 43 45 45 46 47 48 49 50 52 53 55 55 57 60 62 62 67 67 67 70 72 74 75 76 77 77 79 4. Hadoop 1/0 81 Data Integrity 81 Data Integrity in HDFS 81 LocalFileSystem 82 ChecksumFileSystem 83 Compression 83 Codecs 85 Compression and Input Splits 89 Using Compression in MapReduce 90 Serialization 93 The Writable Interface 94 vi I labre ofcontents
Writable Classes Implementing a Custom Writable Serialization Frameworks Avro Avro Data Types and Schemas In-Memory Serialization and Deserialization Avro Datafiles Interoperability Schema Resolution Sort Order Avro MapReduce Sorting Using Avro MapReduce Avro MapReduce in Other Languages File-Based Data Structures SequenceFile MapFile 96 103 108 110 111 114 117 118 121 123 124 128 130 130 130 137 5. Developing amapreduce Application 143 The Configuration API Combining Resources Variable Expansion Setting Up the Development Environment Managing Configuration GenericOptionsParser, Tool, and ToolRunner Writing a Unit Test with MRUnit Mapper Reducer Running Locally on Test Data Running a job in a Local job Runner Testing the Driver Running on a Cluster Packaging a job 144 145 ]46 146 148 150 154 154 156 157 157 160 161 162 Launehing a job 163 The MapReduce Web ur ] 65 Retrieving the Results Debugging a job Hadoop Logs Rernote Debugging Tuning a job Profiling Tasks MapReduce Workflows Decomposing a Problem into MapReduce jobs jobcontrol 168 ]70 ]75 177 178 179 181 181 183 TableofContents I vii
Apache Oozie 183 6. How MapReduce Works 189 Anatomy of a MapReduce Job Run 189 Classic MapReduce (MapReduce 1) 190 YARN (MapReduce 2) 196 Failures 202 Failures in Classic MapReduce 202 Failures in YARN 204 Job Scheduling 206 The Fair Scheduler 207 The Capacity Scheduler 207 Shuffle and Sort 208 The Map Side 208 The Reduce Side 210 Configuration Tuning 211 Task Execution 214 The Task Execution Environment 215 Speculative Execution 215 Output Committers 217 Task JVM Reuse 219 Skipping Bad Records 220 7. MapReduce Types and Formats,, 223 MapReduce Types The Default MapReduce Job Input Formats Input Splits and Records Text Input Binary Input Multiple Inputs Database Input (and Output) Output Formats Text Output Binary Output Multiple Outputs Lazy Output Database Output 223 227 234 234 245 249 250 251 251 252 253 253 257 258 8. MapReduce Features 259 Counters 259 Built-in Counters 259 User-Defined Java Counters 264 viii I labte ofcontents
User-Defined Streaming Counters Sorting Preparation Partial Sort Total Sort Secondary Sort Joins Map-Side joins Reduce-Side Joins Side Data Distribution Using the Job Configuration Distributed Cache MapReduce Library Classes 268 268 269 270 274 277 283 284 285 288 288 289 295 9. Setting Up ahadoop Cluster 297 Cluster Specification Network Topology Cluster Setup and Installation Installing Java Creating a Hadoop User Installing Hadoop Testing the Installation SSH Configuration Hadoop Configuration Configuration Management Environment Settings Important Hadoop Daemon Properties Hadoop Daemon Addresses and Ports Other Hadoop Properties User Account Creation YARN Configuration Important YARN Daemon Properties YARN Daemon Addresses and Ports Security Kerberos and Hadoop Delegation Tokens Other Security Enhancements Benchmarking a Hadoop Cluster Hadoop Benchmarks User Jobs Hadoop in the Cloud Apache Whirr 297 299 301 302 302 302 303 303 304 305 307 311 316 317 320 320 321 324 325 326 328 329 331 331 333 334 334 Table ofcontents I ix
10. Administering Hadoop 339 HDFS Persistent Data Structures Safe Mode Audit Logging Tools Monitoring Logging Metries Java Management Extensions Maintenance Routine Administration Procedures Commissioning and Decommissioning Nodes Upgrades 339 339 344 346 347 351 352 352 355 358 358 359 362 11. Pig 367 Installing and Running Pig Execution Types Running Pig Programs Grunt Pig Latin Editors An Example Generating Examples Comparison with Databases Pig Latin Structure Statements Expressions Types Schemas Functions Macros User-Defined Functions A Filter UDF An Eval UDF A Load UDF Data Processing Operators Loading and Storing Data Filtering Data Grouping and Joining Data Sorting Data Combining and Splitting Data Pig in Practice 368 368 370 370 371 371 373 374 375 376 377 381 382 384 388 390 391 391 394 396 399 399 400 402 407 408 409 x I Table ofcontents
Parallelism Parameter Substitution 409 410 12. Hive 413 Installing Hive The Hive Shell An Example Running Hive Configuring Hive Hive Services The Metastore Comparison with Traditional Databases Schema on Read Versus Schema on Write Updates, Transactions, and Indexes HiveQL Data Types Operators and Functions Tables Managed Tables and External Tables Partitions and Buckets Storage Formats Importing Data Altering Tables Dropping Tables Querying Data Sorting and Aggregating MapReduce Scripts ]oins Subqueries Views User-Defined Functions Writing a UDF Writing a UDAF 414 415 416 417 417 419 421 423 423 424 425 426 428 429 429 431 435 441 443 443 444 444 445 446 449 450 451 452 454 13. HBase 459 HBasics 459 Backdrop 460 Concepts 460 Whirlwind Tour of the Data Model 460 Implementation 461 Installation 464 Test Drive 465 Clients 467 Table ofcontents I xi
Java Avro, REST, and Thrift Example Schemas Loading Data Web Queries HBase Versus RDBMS Successful Service HBase Use Case: HBase at Streamy.com Praxis Versions HDFS UI Metries Schema Design Counters Bulk Load 467 470 472 472 473 476 479 480 481 481 483 483 484 485 485 486 486 487 14. ZooKeeper..................... 489 Installing and Running ZooKeeper 490 An Example 492 Group Membership in ZooKeeper 492 Creating the Group 493 Joining a Group 495 Listing Members in a Group 496 Deleting a Group 498 The ZooKeeper Service 499 Data Model 499 Operations 501 Implementation 506 Consistency 507 Sessions 509 States 511 Building Applications with ZooKeeper 512 A Configuration Service 512 The Resilient ZooKeeper Application 515 A Lock Service 519 More Distributed Data Structures and Protocols 521 ZooKeeper in Production 522 Resilience and Performance 523 Configuration 524 xii I Table ofcontents
15. Sqoop 527 Getting Sqoop Sqoop Connectors A Sampie Import Text and Binary File Formats Generated Code Additional Serialization Systems Imports: A Deeper Look Controlling the Import Imports and Consistency Direct-mode Imports Working with Imported Data Imported Data and Hive Importing Large Objects Performing an Export Exports: A Deeper Look Exports and Transactionality Exports and SequenceFiles 527 529 529 532 532 533 533 535 536 536 536 537 540 542 543 545 545 16. Case Studies 547 Hadoop Usage at Last.frn Last.frn: The Social Music Revolution Hadoop at Last.frn Generating Charts with Hadoop The Track Statistics Program Summary Hadoop and Hive at Facebook Hadoop at Facebook Hypothetical Use Case Studies Hive Problems and Future Work Nutch Search Engine Data Structures Selected Examples of Hadoop Data Processing in Nutch Summary Log Processing at Rackspace Requirements/The Problem Brief History Choosing Hadoop Collection and Storage MapReduce for Logs Cascading Fields, Tuples, and Pipes 547 547 547 548 549 556 556 556 559 562 566 567 568 571 580 581 581 582 582 582 583 589 590 Table ofcontents I xiii
Operations 593 Taps, Sehemes, and Flows 594 Caseading in Praetiee 595 Flexibility 598 Hadoop and Caseading at ShareThis 599 S~m~ TeraByte Sort on Apache Hadoop 603 Using Pig and Wukong to Explore Billion-edge Network Graphs 607 Measuring Cornmunity 609 Everybody's Talkin' at Me: The Twitter Reply Graph 609 Symmetrie Links 612 Community Extraetion 613 A. Installing Apache Hadoop.............................................. 617 B. Cloudera's Distribution Including Apache Hadoop 623 C. Preparing the NCDC Weather Data 625 Index 629 6m xiv I Table ofcontents