Hadoop Overview. Lars George Director EMEA Services

Hadoop Overview Lars George Director EMEA Services 1

About Me Director EMEA Services @ Cloudera Consulting on Hadoop projects (everywhere) Apache Committer HBase and Whirr O Reilly Author HBase The Definitive Guide Contact Now in Japanese! lars@cloudera.com @larsgeorge 日本語版も出ました!

Agenda Part 1: Why Hadoop? Part 2: Hadoop in the Enterprise Infrastructure Part 3: What is Hadoop? Part 4: Use-Cases 3

Why Hadoop? Part 1

The Progression to Big Data THEN NOW GB VOLUME PB Structured VARIETY Structured + Unstructured Trickle VELOCITY Torrent Operational Reporting VALUE Reporting + Data Discovery 5

Pain Points: Data Management Can t ingest fast enough Costs too much to store Exists in different places Archived data is lost

Pain Points: Data Exploration & Analysis Analysis and processing takes too long Data exists in silos Can t ask new questions Can t analyze unstructured data

Apache Hadoop A Revolutionary Platform for Big Data INGEST STORE EXPLORE PROCESS ANALYZE SERVE VOLUME Distributed architecture scales cost-effectively VARIETY Store data in any format VELOCITY Load raw data and define how you look at it later VALUE Process data faster, Ask any question 8

Hadoop and Relational Databases Schema-on-Write Schema-on-Read Schema must be created before any data can be loaded An explicit load operation has to take place which transforms data to DB internal structure New columns must be added explicitly before new data for such columns can be loaded into the database Data is simply copied to the file store, no transformation is needed A SerDe (Serializer/Deserlizer) is applied during read time to extract the required columns (late binding) New data can start flowing anytime and will appear retroactively once the SerDe is updated to parse it 1) Reads are Fast 2) Standards and Governance PROS 1) Loads are Fast 2) Flexibility and Agility 9

Hadoop and Relational Databases You need Best Used For: Canonical Structured Data Interactive OLAP Analytics (<1sec) Multistep ACID Transactions 100% SQL Compliance Best Used For: Structured or Not (Flexibility) Exploratory Analysis (1sec-5min) Scalability of Storage/Compute Complex Data Processing 10

Hadoop in the Enterprise Infrastructure Part 2

Cloudera s Vision for Hadoop LEGACY Multiple platforms NEW A single data platform COMPLEX, FRAGMENTED, COSTLY SIMPLIFIED, UNIFIED, EFFICIENT 12

Hadoop in the Enterprise OPERATORS DATA ARCHITECTS ENGINEERS DATA SCIENTISTS ANALYSTS BUSINESS USERS Management Tools Metadata / ETL Tools Developer Tools Data Modeling BI / Analytics Enterprise Reporting Hadoop Platform Enterprise Data Warehouse Data Serving Systems Logs Files Web Data Relational Databases Web / Mobile Applications CUSTOMERS 13

What Is Hadoop? Part 3

The Origins of Hadoop Source: Credit Suisse 15

Core Hadoop The Basics 16

What is Apache Hadoop? Apache Hadoop is an open source platform for data storage and processing that is Scalable Fault tolerant Distributed CORE HADOOP SYSTEM COMPONENTS Hadoop Distributed File System (HDFS) Self-Healing, High Bandwidth Clustered Storage MapReduce/ YARN + MRv2 Distributed Computing Framework Works with Every Type of Data Brings Computation to Storage Changes the Economics of Data Management 17

Hadoop Components Hadoop consists of two core components The Hadoop Distributed File System (HDFS) Distributed Processing Framework (MapReduce etc.) There are many other projects based around core Hadoop Often referred to as the Hadoop Ecosystem Pig, Hive, HBase, Flume, Oozie, Sqoop, etc. More on this later A set of machines running HDFS and MapReduce is known as a Hadoop Cluster Individual machines are known as nodes A cluster can have as few as one node, as many as several thousands More nodes = more capacity & better performance!

Core Hadoop Concepts Data is spread among machines in advance Computation happens where the data is stored, wherever possible Data is replicated multiple times on the system for increased availability and reliability Nodes talk to each other as little as possible Shared nothing architecture The system (vs. developers/applications) handles communication between nodes Applications are written in high-level code Developers do not worry about network programming, temporal dependencies, etc. Applications can be written in virtually any programming language

Hadoop Components: HDFS HDFS, the Hadoop Distributed File System, is responsible for storing data on the cluster Data files are split into blocks and distributed across multiple nodes in the cluster Each block is replicated multiple times Default is to replicate each block three times Replicas are stored on different nodes This ensures both reliability and availability

HDFS Basic Concepts HDFS is a filesystem written in Java Based on Google s GFS Sits on top of a native filesystem ext3, xfs etc Provides redundant storage for massive amounts of data Using cheap, unreliable computers

HDFS Basic Concepts (cont d) HDFS performs best with a modest number of large files Millions, rather than billions, of files Each file typically 100Mb or more Files in HDFS are write once No random writes to files are allowed HDFS is optimized for large, streaming reads of files Rather than random reads

Getting Data in and out of HDFS Hadoop API hadoop fs to work with data in HDFS Ecosystem Projects Flume Collects data from log generating sources (e.g., Websites, syslogs, STDOUT) Sqoop Extracts and/or inserts data between HDFS and RDBMS Business Intelligence Tools

Hadoop Components: MapReduce MapReduce is the system used to process data in the Hadoop cluster Consists of two phases: Map, and then Reduce Each Map task operates on a discrete portion of the overall dataset Typically one HDFS data block After all Maps are complete, the MapReduce system distributes the intermediate data to nodes which perform the Reduce phase Much more on this later!

Features of MapReduce Automatic parallelization and distribution Fault-tolerance Status and monitoring tools A clean abstraction for programmers MapReduce programs are usually written in Java MapReduce abstracts all the housekeeping away from the developer Developer can concentrate simply on writing the Map and Reduce functions

How MapReduce Works Word Count Example: Mapping Shuffling Reducing Mapper Input The cat sat on the mat The aardvark sat on the sofa The, 1 cat, 1 sat, 1 on, 1 the, 1 mat, 1 The, 1 aardvark, 1 sat, 1 on, 1 the, 1 sofa, 1 aardvark, 1 cat, 1 mat, 1 on [1, 1] sat [1, 1] sofa, 1 the [1, 1, 1, 1] aardvark, 1 cat, 1 mat, 1 on, 2 sat, 2 sofa, 1 the, 4 Final Result aardvark, 1 cat, 1 mat, 1 on, 2 sat, 2 sofa, 1 the, 4 26

The Hadoop Ecosystem Making Hadoop Function as Part of an Enterprise Infrastructure 27

Introduction The term Hadoop is taken to be the combination of HDFS and MapReduce There are numerous other projects surrounding Hadoop Typically referred to as the Hadoop Ecosystem Most are incorporated into Cloudera s Distribution Including Apache Hadoop (CDH) All use either HDFS, MapReduce, or both

Preview of CDH CDH 100% OPEN SOURCE CLOUD WH WHIRR USER INTERFACE WORKFLOW MGMT METADATA HU OO HUE OOZIE INTEGRATION SQ SQOOP FL FLUME FILE FUSE-DFS REST WEBHDFS HTTPFS BATCH PROCESSING HI PI HIVE PIG BATCH COMPUTE MR MR2 MAPREDUCE MAPREDUCE2 RESOURCE MGMT & COORDINATION MA MAHOUT DF DATAFU YA YARN REAL-TIME ACCESS & COMPUTE IM IMPALA ZO ZOOKEEPER SE SEARCH AC ACCESS MS META STORE SQL ODBC JDBC STORAGE HDFS HADOOP DFS HB HBASE 29

Data Lifecycle Process Store Explore Analyze Serve Business Analysts Business Users Customers HDFS, Sqoop, Flume Impala, Hive, Pig MapReduce, Impala, Hive, Pig, Mahout, HBase 30

Beyond Batch: Real Time Query for Hadoop Cloudera Impala BEFORE IMPALA WITH IMPALA USER INTERFACE BATCH PROCESSING REAL-TIME ACCESS Speed to Insight Get answers as fast as you can ask questions Interactive analytics directly on source data No jumping between data silos Cost Savings Reduce duplicate storage with EDW Reduce data movement for analysis Leverage existing tools and employee skills Full Fidelity Analysis Ask questions of all your data No loss of fidelity from aggregation or conforming to fixed schemas Discoverability Single metadata store from source to analysis Supports familiar SQL language and existing BI tools Enables more users to interact with data 31 CONFIDENTIAL - RESTRICTED

Use-Cases Part 4

Ask Bigger Questions: How do we prevent mobile device returns? A leading manufacturer of mobile devices gleans new insights & delivers instant software bug fixes.

Cloudera complements the data warehouse The Challenge: Fast-growing Oracle DW difficult & expensive to maintain performance at scale Need to ingest massive volumes of unstructured data very quickly Mobile technology leader identified a hidden software bug causing sudden spike in returns. The Solution: Cloudera Enterprise + RTD: data processing, storage & analysis on 25 years data Integrated with Oracle: closed loop analytical process Collecting device data every min., loading 1TB/day into Cloudera 34 Read the case study: http://www.cloudera.com/content/cloudera/en/resources/library/casestudy/drivinginnovation-in-mobile-devices-with-cloudera-and-oracle.html

Ask Bigger Questions: How do we feed the world? A Fortune 500 company specializing in agriculture and genomics can automate datadriven R&D decisions to reduce time to market from years to months. 35

Fortune 500 agriculture company SITUATION SOLUTION RESULTS OPPORTUNITY More than 1,000 research scientists building product development algorithms Time to market for new products is 5-10 years BARRIERS Algorithms built in silos Data processing bottleneck slows development R&D data pipeline for each product involves series of questions & decisions 36

Fortune 500 agriculture company SITUATION SOLUTION RESULTS CLOUDERA ENTERPRISE CORE + RTD, RTQ PB-scale platform for consolidated view of all R&D data Integration with Oracle Exadata, Lucene Solr, spatial awareness, visualization Hadoop components: Avro, HDFS, HBase, Hive, Hue, MapReduce, Oozie, Pig, Sqoop 37

Fortune 500 agriculture company SITUATION SOLUTION RESULTS BENEFITS PB-scale Increased usability Scientists directly access Hadoop Flexibility Consolidated view of all data within R&D MEASURED IMPACT Data-driven decisions in R&D pipeline automated; reduces time to market of new products Which traits do we want to integrate into this germ plasm? Which male & female plants should be brought together to create a child plant? Where should the child plant be tested? 38