Copyright 2015 EMC Corporation. All rights reserved. A long time ago

Size: px

Start display at page:

Christina Hampton
5 years ago
Views:

1 1

2 A long time ago

3 AP REDUCE HDFS

4 IN A BLINK OF AN EYE Crunch Mahout YARN MLib PivotalR Hadoop UI Hue Coordination and workflow management Zookeeper Pig Hive MapReduce Tez Giraph Phoenix SolrCloud Flink HBase Shark GraphX Streaming Spark Tachyon HDFS HAWQ MADlib Oozie Sqoop SpringXD Flume ASF Projects FLOSS Projects Pivotal Products

5 OSS Market Trend Source: DB-Engines.com

6 Is there a catch? Source: DB-Engines.com

7 PIVOTAL BIG DATA SUITE ALEXANDER ERMAKOV - PIVOTAL 7

8 EMC Federation

9 Big Data Suite 2015: Agile Data Stack Advanced Analytics Apps at Scale Pivotal Greenplum Database Pivotal HAWQ Redis Pivotal GemFire Rabbit MQ Pivotal Labs & Data Science Labs Pivotal Cloud Foundry Data Processing Spring XD Spark Pivotal HD & Open Data Platform Commodity Hardware Appliance Hybrid Cloud Pivotal Cloud Foundry

10 Data Access Lookup Query Analytics GemFire HBase Greenplum DB MapReduce Pig Hive Real Time Interactive Batch HAWQ

11 Data Ingestion Streaming Micro batch Spring XD Event collection N/A Flume GPLoad Sqoop Mega batch Event processing

12 Big Data Suite

MPP Shared Nothing Architecture Flexible framework for processing large datasets Master Host and Standby Master Host Master coordinates work with Segment Hosts Segment Host with one or more Segment

14 MPP Shared Nothing Architecture Flexible framework for processing large datasets Master Host and Standby Master Host Master coordinates work with Segment Hosts Segment Host with one or more Segment Instances Segment Instances process queries in parallel Segment Hosts have their own CPU, disk and memory (shared nothing) High speed interconnect for continuous pipelining of data processing node1 Segment Host Segment Instance Segment Instance Segment Instance Segment Instance node2 Interconnect Segment Host Segment Segment Instance Host Segment Segment Instance Instance Segment Segment Instance Instance Segment Segment Instance Instance Segment Instance Master Host node3 SQL Segment Host Segment Instance Segment Instance Segment Instance Segment Instance Standby Master noden Segment Host Segment Instance Segment Instance Segment Instance Segment Instance

15 ARCHITECTURE: NO-FORKLIFT SCALABILITY... New Segment Servers Query planning & dispatch Advantages: Scale Existing Systems No Forklifting Immediate Capacity Increase Simple Process Connect New Hardware Simple Restart Schedule Redistribution of Existing Data

16 PERFORMANCE: PARALLEL QUERY OPTIMIZER Cost-based optimization looks for the most efficient plan Physical plan contains scans, joins, sorts, aggregations, etc. Global planning avoids suboptimal SQL pushing to segments Directly inserts motion nodes for inter-segment communication Seq Scan on line item Redistribute Motion 4:4(Slice 1) HashJoin PHYSICAL EXECUTION PLAN FROM SQL OR MAPREDUCE Hash Seq Scan on orders Gather Motion 4:1(Slice 3) Sort HashAggregate HashJoin Seq Scan on customer Hash HashJoin Hash Broadcast Motion 4:4(Slice 2) Seq Scan on motion

17 LOADING: MASSIVELY-PARALLEL INGEST Extreme speed and, immediate usability from files, ETL & Hadoop Fast Parallel Load & Unload No Master Node bottleneck 10+ TB/Hour per Rack Linear scalability Low Latency Data immediately available No intermediate stores No data reorganization Load/Unload To & From: File Systems ETL Products Hadoop Distributions Master Servers Query planning & dispatch gnet Network Interconnect Segment Servers Query processing & data storage External Sources Loading, streaming, etc ETL SQL File Systems

STORAGE: POLYMORPHIC TABLE STORAGE TABLE CUSTOMER Mar 11 Apr 11 May 11 Jun 11 Jul 11 Aug 11 Sept 11 Oct 11 Nov 11 Column-oriented for COLD DATA Row-oriented for HOT DATA Provide the choice of

18 STORAGE: POLYMORPHIC TABLE STORAGE TABLE CUSTOMER Mar 11 Apr 11 May 11 Jun 11 Jul 11 Aug 11 Sept 11 Oct 11 Nov 11 Column-oriented for COLD DATA Row-oriented for HOT DATA Provide the choice of processing model for any table or any individual partition Enable Information Lifecycle Management (ILM) Storage types can be mixed within a table or database Four table types: heap, row-oriented AO, column-oriented, external Block compression: Gzip (levels 1-9), QuickLZ Columnar compression: RLE

19 HIGH AVAILABILITY (MASTER SERVER) Master Server Data Protection Replicated transaction logs for server failure Optional RAID protection for drive failures Upon server failure Standby server activated Administrator alerted Orchestrated failover Master Master Segment Server Data Protection Mirrored segments for server failures Optional RAID protection for drive failures Upon server failure Mirrored segments take over with no loss of service Fast online differential recovery Segment Segment Segment Segment

20 HIGH AVAILABILITY (SEGMENT SERVER) Master Server Data Protection Replicated transaction logs for server failure Optional RAID protection for drive failures Upon server failure Standby server activated Administrator alerted Orchestrated failover Segment Server Data Protection Mirrored segments for server failures Optional RAID protection for drive failures Upon server failure Node 1 Node 2 Node 3 Node 4 Mirrored segments take over with no loss of service Fast online differential recovery P1 P2 P3 M6 M8 M10 P4 P5 P6 M1 M9 M11 P7 P8 P9 M2 M4 M12 P10 P11 P12 M3 M5 M7 Active Blocks

Scoring Accelerator MADLib An open-source library of advanced analytics functions Analytics

21 EXTENSIBLE FOR ANALYTICS: IN-DATABASE ANALYTICAL ALGORITHMS Bringing the power of parallelism to commonly-used modeling and analytics functions MAD lib In-database analytics SAS HPA, Access, and Scoring Accelerator MADLib An open-source library of advanced analytics functions Analytics extensions supported, including PostGIS - Geospatial support, PL/R - Statistical Computing, PL/Java, PL/Perl, etc.

SIMPLE TO MANAGE Greenplum Command Center Complete platform management and control Greenplum Package Manager Automates install, uninstall,

22 SIMPLE TO MANAGE Greenplum Command Center Complete platform management and control Greenplum Package Manager Automates install, uninstall, update, and query for analytics extensions Support package migration during upgrade, segment recovery, expansion, and standby initialization

Pivotal HD 100% Apache Hadoop-based platform

Scale tested in 1000 node Pivotal Analytics Workbench

solution Backed by EMC s global, 24x7 support

24 Pivotal HD 100% Apache Hadoop-based platform Virtualization and cloud ready with VMWare and Isilon Scale tested in 1000 node Pivotal Analytics Workbench Available as a software-only or appliance-based solution Backed by EMC s global, 24x7 support infrastructure Collaboration with Apache Software Foundation (ASF) and Hortonworks (ODP)

25 Standardize Hadoop Ecosystem Open Data Platform Focused on developing common core to enable Hadoop ecosystem Focused on core components of Hadoop HDFS, MapReduce, YARN and Amabri Rapidly accelerated certifications, ecosystem development, predictability and enterprise applicability

26 PIVOTAL HD 3.0 ARCHITECTURE HAWQ 1.3 Advanced Database Services Pivotal HD Enterprise Resource Management & Workflow Yarn Zookeeper HBase Xtension Framework ANSI SQL + Analytics Catalog Services Dynamic Pipelining HDFS Query Optimizer Pig, Hive, Mahout Map Reduce Deploy, Configure, Monitor, Manage AMBARI Oozie Sqoop SpringXD Flume

27 HAWQ: The Crown Jewels SQL compliant World-class query optimizer Interactive query Horizontal scalability Robust data management Common Hadoop formats Deep analytics

28 HAWQ Simply Multi-User Platform Resource Queues Concurrency Data Encryption Role-Based Security ANSI SQL 2003/2011 Support SQL Engine Cost-Based Query Optimization Robust Query Optimizer Complex Data Management Sub-Partitioning Distributions Partitioning CPU Mem Disk Users Accessibility Storage Options ODBC/JDBC Driver L3,4 Parallel Loading/Unloading HDFS Native Formats Extendable txt Avro Seq HBase Hive Parquet Greenplum database replatformed on Hadoop/HDFS Polymorphic Storage Row/Columnar Storage Built-in Compression HDFS Native Formats MapReduce Integration

29 Loading/Unloading Data gpload, gpfdist, External Tables Flat Files, CSV, Delimited, Existing RDBMS Systems Web Tables, JSON, XML, HTML, Executing Scripts, DataLoader File Farms Streaming Batch Mode Flume, integration PXF {Native Hadoop Files} HDFS Flat Files, CSV, Delimited, Hive HBase {w. predicate push-down} Avro, RCFile, SeqFile, Parquet Open extendable API Available on Github: Accumulo, JSON, Spring XD Java Development Framework Traditional Tools Postgres insert, copy, ODBC + JDBC drivers Pivotal Data Dispatch {PDD} Integration with ETL tools Throttling, Compression, features Highly Parallel methods to integrate with HAWQ

30 HAWQ Storage Options Tables in HAWQ can be: Distributed Partitioned by range or list Row or columnar oriented Compressed with zlib, quicklz, RLE, Polymorphic storage TABLE A SEG-1 SEG-2 SEG-3 SEG-4 SEG-N PART A PART A PART A PART A SUB-PART SUB-PART SUB-PART SUB-PART SUB-PART SUB-PART SUB-PART SUB-PART DISTRIBUTION PARTITIONS ROW ROW ROW ROW COLUMNAR COLUMNAR COLUMNAR COLUMNAR POLYMORPHIC STORAGE COMPRESS COMPRESS COMPRESS COMPRESS

31 Data Distribution Data can be distributed based on a column or a composite of columns Tables distributed similarly are co-located Distribution scheme modifiable thru alter table Advantages: Co-located joins No data movement on joins or aggregates Improved performance on complex queries Query engine optimization Table A DN1 DN2 DN3 X=1 X=2 X=3 X=4 X=5 Y=1 Y=2 Y=3 Table B SELECT X FROM A,B WHERE A.X = B.Y SELECT SUM(X) FROM A GROUP BY A.X

32 HAWQ Distribution vs. Hive Partitioning In Hive partitions are organized into folders Folders are spread across entire HDFS Similar data are not c0-located, data location is lost Data movement is required for large joins and aggregates Hive partitions help in sequential scan of the original table only DATA IS SPREAD ON HDFS Table A FOLDER a FOLDER b FOLDER c X=1 X=2 X=3 X=4 X=5 DN1 DN2 DN3 Y=1 Y=2 Y=3 FOLDER aa FOLDER bb Table B NO CO-LOCATED JOINS, NO CO-LOCATED AGGREGATES

33 Basic HAWQ Architecture HAWQ Standby Master Parser Local TM Query Executor HAWQ Master Query Optimizer Dispatch PXF NameNode HDFS Secondary NameNode HDFS Local Storage Interconnect DataNode Segment Host Query Executor PXF Segment [Segment ] Local Temp Storage HDFS DataNode Segment Host Query Executor PXF Segment [Segment ] Local Temp Storage HDFS

35 Pivotal GemFire Pivotal GemFire is the distributed, NoSQL, in-memory database (IMDG): 1. Scale-out performance 2. Consistent database operations across globally distributed nodes 3. High availability, resilience, and global scale 4. Powerful developer features 5. Easy administration of distributed nodes

36 GEMFIRE OVERVIEW

sent to each client Copyright Client 2015

37 GemFire Client A client can be a publisher or a subscriber or both Clients have access to all the data in the DS Data is usually a single hop away Clients can cache data locally Clients can register interest in specific items Keys List of keys Regular Expressions Continuous Queries Qualifying updates are sent to each client Copyright Client 2015 EMC side Corporation. All machinery rights reserved. gets invoked in

SERVER WEB SERVER GEM CACHE GEM PEER GEM PEER GEM PEER GEM CLIENT GEM CLIENT GEM CLIENT GEM SERVER GEM SERVER GEM SERVER Flexibility

38 DEPLOYMENT FLEXIBILITY FOR IN- MEMORY APPS Embedded WEB SERVER WEB SERVER Embedded, Clustered WEB SERVER WEB SERVER Tiered, Clustered WEB SERVER WEB SERVER WEB SERVER Distributed, Clustered Geo-distributed WEB SERVER WEB SERVER WEB SERVER WEB SERVER WEB SERVER WEB SERVER WEB SERVER GEM CACHE GEM PEER GEM PEER GEM PEER GEM CLIENT GEM CLIENT GEM CLIENT GEM SERVER GEM SERVER GEM SERVER Flexibility Flexibility Scale Flexibility Scale Performance Flexibility Scale Performance Availability Flexibility Scale Performance Availability Localization

Gain Insights From Unstructured Data Using Pivotal HD. Copyright 2013 EMC Corporation. All rights reserved.

Gain Insights From Unstructured Data Using Pivotal HD 1 Traditional Enterprise Analytics Process 2 The Fundamental Paradigm Shift Internet age and exploding data growth Enterprises leverage new data sources