How Apache Hadoop Complements Existing BI Systems Dr. Amr Awadallah Founder, CTO Cloudera, Inc. Twitter: @awadallah, @cloudera 2
The Problems with Current Data Systems BI Reports + Interactive Apps RDBMS (aggregated data) 1. Can t Explore Original High Fidelity Raw Data ETL Compute Grid 2. Moving Data To Compute Doesn t Scale Storage Only Grid (original raw data) Mostly Append Collection 3. Archiving = Premature Data Death Instrumentation 3
The Solution: A Combined Storage/Compute Layer BI Reports + Interactive Apps RDBMS (aggregated data) 1. Data Exploration & Advanced Analytics 2. Scalable Throughput For ETL & Aggregation Hadoop: Storage + Compute Grid Mostly Append 3. Keep Data Alive For Ever Collection Instrumentation 4
So What is Apache Hadoop? A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license). Core Hadoop has two main systems: Hadoop Distributed File System: self-healing high-bandwidth clustered storage. MapReduce: distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction. Key business values: Flexibility Store any data, Run any analysis. Scalability Start at 1TB/3-nodes grow to petabytes/1000s of nodes. Economics Cost per TB at a fraction of traditional options. 5
The Key Benefit: Agility/Flexibility Schema-on-Write (RDBMS): Schema must be created before any data can be loaded. An explicit load operation has to take place which transforms data to DB internal structure. New columns must be added explicitly before new data for such columns can be loaded into the database. Schema-on-Read (Hadoop): Data is simply copied to the file store, no transformation is needed. A SerDe (Serializer/Deserlizer) is applied during read time to extract the required columns (late binding) New data can start flowing anytime and will appear retroactively once the SerDe is updated to parse it. Read is Fast Standards/Governance Pros Load is Fast Flexibility/Agility 6
Flexibility: Complex Data Processing 1. Java MapReduce: Most flexibility and performance, but tedious development cycle (the assembly language of Hadoop). 2. Streaming MapReduce (aka Pipes): Allows you to develop in any programming language of your choice, but slightly lower performance and less flexibility than native Java MapReduce. 3. Crunch: A library for multi-stage MapReduce pipelines in Java (modeled after Google s FlumeJava, also like Cascading). 4. Pig Latin: A high-level language out of Yahoo, suitable for batch data flow workloads. 5. Hive: A SQL interpreter out of Facebook, also includes a metastore mapping files to their schemas and associated SerDes. 6. Oozie: A PDL XML workflow engine that enables creating a workflow of jobs composed of any of the above. 7
Scalability: Scalable Software Development Grows without requiring developers to re-architect their data or applications. AUTO SCALE 8
Economics: Return on Byte Return on Byte (ROB) = value to be extracted from that byte divided by the cost of storing that byte If ROB is < 1 then it will be buried into tape wasteland, thus we need more economical active storage. High ROB Low ROB 9
Economics: Keep All Data Alive Forever Archive to Tape and Never See It Again Extract Value From All Your Data 10
Use The Right Tool For The Right Job Relational Databases: Hadoop: Use when: Interactive OLAP Analytics Multistep ACID Transactions 100% SQL Compliance Use when: Structured or Not (Flexibility) Scalability of Storage/Compute Complex Data Processing 11
Build/Test: APACHE BIGTOP CDH: Cloudera s Distribution Including Apache Hadoop File System Mount Web Console Data Mining FUSE-NFS HUE APACHE MAHOUT BI Connectivity Job Workflow Metadata Mgmt ODBC/JDBC APACHE OOZIE APACHE HIVE Data Integration APACHE FLUME, APACHE SQOOP Analytical Languages APACHE PIG, APACHE HIVE Data Serving APACHE HBASE APACHE WHIRR Coordination APACHE ZOOKEEPER Cloudera Manager Free Edition (Installation Wizard) 12
HBase versus HDFS HDFS: Optimized For: Large Files Sequential Access (Hi Throughput) Append Only Use For: Fact tables that are mostly append only and require sequential full table scans. HBase: Optimized For: Small Records Random Access (Lo Latency) Atomic Record Updates Use For: Dimension tables which are updated frequently and require random low-latency lookups. Not Suitable For: Low Latency Interactive OLAP. 13
Hadoop in the Enterprise Data Stack ENGINEERS DATA SCIENTISTS ANALYSTS BUSINESS USERS DATA ARCHITECTS SYSTEM OPERATORS IDEs Modeling Tools BI / Analytics Enterprise Reporting Meta Data/ ETL Tools Cloudera Manager HParser ODBC, JDBC, NFS Sqoop Enterprise Data Warehouse Flume Flume Flume Logs Files Web Data Sqoop Relational Databases Sqoop Online Serving Systems Web/Mobile Applications CUSTOMERS 14
Informatica HParser Environment http://www.informatica.com/videos/demos/1846_hparser_simpleedi/default.htm 15
ODBC Connector for MicroStrategy/Tableau In-Memory I-Cubes ODBC Multi-Source Enterprise Data Warehouse CLOUDERA ODBC DRIVER HIVE HIVE SQL Hadoop MapReduce HDFS HBase 16
Three Deployment Options Solution Components Option 1: The complete solution Cloudera Enterprise (Cloudera Manager + Support) + CDH Fastest rollout Easiest to deploy, manage and configure Fully supported Annual Subscription Option 2: The no cost trial Cloudera Manager Free Edition + CDH No software cost No support included Manage up to 50 nodes Limited functionality in Cloudera Manager Option 3: CDH Free Download CDH Only Self install and configure No software cost No support included More management & installation complexity Longer rollout for most customers Proprietary Software + Support Cloudera Enterprise Cloudera Manager Cloudera Support Cloudera Manager (Free Edition) Up to 50 Nodes 100% Open Source Solution: Apache Hadoop + selected components CDH (Cloudera Distribution including Apache Hadoop) CDH (Cloudera Distribution including Apache Hadoop CDH (Cloudera Distribution including Apache Hadoop 17 17
Conclusion The Key Benefits of Apache Hadoop: Agility (Dynamic Run-Time Schemas, Any Language). Scalability of Storage/Compute (Freedom to Grow). Economics (Keep All Your Data Alive Forever). Cloudera has the: most complete technology offering, most experienced team, most diversified track-record, and richest partner ecosystem to enable your success with Apache Hadoop. 18
Appendix BACKUP SLIDES 19
Hadoop Creation History Fastest sort of a TB, 62secs over 1,460 nodes Sorted a PB in 16.25hours over 3,658 nodes 20
HDFS: Hadoop Distributed File System A given file is broken down into blocks (default=64mb), then blocks are replicated across cluster (default=3). Optimized for: Throughput Put/Get/Delete Appends Block Replication for: Durability Availability Throughput Block Replicas are distributed across servers and racks. 21
MapReduce: Resource Manager / Scheduler A given job is broken down into tasks, then tasks are scheduled to be as close to data as possible. Three levels of data locality: Same server as data (local disk) Same rack as data (rack/leaf switch) Wherever there is a free slot (cross rack) Optimized for: Batch Processing Failure Recovery System detects laggard tasks and speculatively executes parallel tasks on the same slice of data. 22
MapReduce Next Gen Main idea is to split up the JobTracker functions: Cluster resource management (for tracking and allocating nodes) Application life-cycle management (for MapReduce scheduling and execution) Enables: High Availability Better Scalability Efficient Slot Allocation Rolling Upgrades Non-MapReduce Apps 23
Books 24
25