Cloud + Big Data Putting it all Together

Big, Fast and Flexible Data Big Big Data Processing Fast OLTP workloads Flexible Document Object Big Data Analytics Analytic workloads Key / Value OSS Relational Cloud Delivery Model Data as a service for private and public clouds 3

Big, Fast and Flexible Data Big Big Data Processing Fast OLTP workloads Flexible Document Serengeti Object GemFire Big Data Analytics Analytic workloads Key / Value OSS Relational vpostgres Cloud Delivery Model Data as a service for private and public clouds 4

Cloud Stack Neutral View SaaS PaaS IaaS 5

6 Big Data IaaS

but first, some Background. How to build an IaaS Cloud 7

https://customer.portal.org Generate Ticket 1:st Line Support Service Catalog Workflow engine SLA Descriptions Show back Billing Information Customer Portal ITSM Ticketing Change Mgmt Support Change & Release Mgmt Automated ITIL Process Including Approvals Service Renewal Service Owner Service Delivery Management Cost Models Usage Allocation Pay As You Go CB / SB Exported Billing Information Performance Mgmt Resource Mgmt Capacity Mgmt Compliance Mgmt Customer A Users Groups Service Catalog Custome r B Customer C Automated Provisioning Multi Tenancy IT Service Catalog Resource Distribution Resource Allocation Users Groups Service Catalog Users Groups Service Catalog Administrative Interface / Resource Allocation and Definition Customer D Users Groups Service Catalog Central Infrastructure Management Cust A Gold Network & Security Firewall VPN Load Balancer NAT Cust B Silve r Cust C Bronze Out Of The Box Integration Human Interaction Integration must be built

https://customer.portal.org Generate Ticket 1:st Line Support Service Service Catalog Workflow engine SLA Manager Descriptions Show back Billing Information -- Customer Portal DynamicOps ITSM Ticketing Change Mgmt Support Service Manager Change & Release Director Mgmt Application Automated ITIL Process Including Approvals Service Manager Renewal Service Owner Service Service Delivery Manager Management / ITBM Cost Models Usage Allocation vcenter Pay As You Go CB / SB Exported Billing Information Chargeback Performance Mgmt vcenter Resource Operations Mgmt Capacity Mgmt Compliance Suite Mgmt Customer A Users Groups Service Catalog Custome r B Customer C Automated Provisioning Multi Tenancy IT Service Catalog Resource Distribution Resource Allocation vcloud Director Users Groups Service Catalog Users Groups Service Catalog Administrative Interface / Resource Allocation and Definition vsphere Customer D Users Groups Service Catalog Cust A Gold Network & Security Firewall VPN vcns Load Balancer NAT Cust B Silve r Cust C Bronze Out Of The Box Integration Human Interaction Integration must be built

Organization: Finance Organization: Marketing Users & Policies Org VDC Catalogs Users & Policies Org VDC Catalogs Gold Provider Virtual Datacenters Silver Bronze VMware vcenter Server Resource Pools Datastores Port Groups VMware vsphere

Complete Cloud Suite Management Cloud Infrastructure Extensibility vfabric Application Director vcenter Operations Mgmt Suite vcenter Site Recovery Manager Software Defined Storage vcloud Director Software Defined Networking Virtualization vsphere Software Defined Security (server, storage, network) Software Defined Availability vcloud APIs vcloud Connector vcenter Orchestrator 11

Virtualizing Hadoop Project Serengeti 12

13 3 Big Reasons to Virtualize Hadoop

1. Virtualize Hardware Big SQL NoSQL Hadoop SQL NoSQL Unified Big Data Infrastructure Private Public Hadoop DSS 14

2. Rapid Provisioning I want my Hadoop cluster NOW! 15

3. Leverage Capabilities Increase Utilization No single points of failure VM Isolation Resource Management 16

What? Hadoop in a VM? Really? Actually, Hadoop performs well in a virtual machine 17

Performance of Hadoop for Several Workloads Ratio of time taken Lower is Better 1,2 1 Ratio to Native 0,8 0,6 0,4 1 VM 2 VMs 0,2 0 18

Fast Provisioning From a seed node to a cluster Thin Provisioning Linked Clone 60GB => 3.5GB ~6 second 19

Being Efficient through Resource over-commitment Memory over-commitment Hadoop JVMs hold onto memory even when not busy vsphere memory overcommit allows us to pack more hadoop nodes per host If you use EM4J, this can be optimized further Disk over-commitment Hadoop is designed for large dataset Thin-provisioning is wonderful in saving disk footprint 20

Performance Create more smaller VMs Makes Hadoop scale better Single large Hadoop node is limited by JVM scalability Allows for easier/faster adjustment of packing of VMs across hosts by vsphere (through DRS) Sizing/Configuration of storage is critical Plan on ~50Mbytes/sec of bandwidth per core SAN ports/switches will limit performance SANs are typically configured by default for IOPS, not Bandwidth Performance of the backend storage should be tested/sized Local disks will give ~100MBytes/sec per disk: pick correct controller 21

Summary Hadoop does work well in a virtual environment Plan a virtual cluster, enable other big-data solutions on the same infrastructure Leverage the recipes to automate your configuration and deployment 22

The big glaring hole [with cloud] is data handling. -Adrian Kunzle, MD Head of Engineering & Architecture, JPMorgan Chase

New Ways to Work with Data NoSQL In-memory Key/value pairs, simplicity, high productivity Different offerings, different data models: document, graph, big table, column NewSQL In-memory Scalability benefits of in-memory systems with standardized SQL Classic SQL Traditional RDBMS ACID (atomicity, consistency, isolation, durability) 24

How do you scale the data tier? 25

vfabric GemFire Application Data Lives Here Application Data Sleeps Here 26

Key Capabilities Low-latency, linearly-scalable, memory-based data fabric Data-aware execution Active/continuous querying and event notification 27

Primary Use Cases Web session cache, L2 cache App data cache, in-memory DB Grid data fabric: client compute Grid data fabric: fabric compute 28

Existing Applications New Applications vfabric Data Director DBA App Dev Automation Self-Service Provisioning Backup / Restore Clone One click HA DBA IT Admin Policy Based Control Resource Mgmt Security Mgmt Database Templates Monitor Private Cloud Hybrid Cloud Public Cloud

Big Data PaaS Cloud Foundry & vfabric 30

Cloud Stack Neutral View SaaS PaaS IaaS 31

Cloud Stack Classic Pyramid SaaS PaaS IaaS 32

Cloud Stack By Numbers SaaS PaaS IaaS 33

Cloud Stack By Value SaaS PaaS IaaS 34

Big Data PaaS Architecture Business Intelligence Applications UI Framework Data Integration Big Data API Data Process Analytics Workflow Scheduling Metadata Languages U-Data Store Coordination Other Application Services Graph Store Read / Write Access Application Lifecycle Management Security Systems Monitoring & Management Infrastructure as a Service (IaaS) 35

37 OSS community

vfabric Postgres Data Services vfabric RabbitMQ TM Msg Services Additional partners services Other Services 38

Data Services Private Clouds Msg Services Public Clouds Partners Other Services Micro Clouds.COM 39

VMware Cloud Application Platform Programming Model Rich Web Social and Mobile Data Access Integration Patterns Batch Framework Spring Tool Suite WaveMaker Cloud Foundry Java Runtime (tc Server) Web Runtime (ERS) Messaging (RabbitMQ) Global Data (GemFire) In-mem SQL (SQLFire) App Monitoring (Spring Insight) Performance Mgmt (Hyperic) Java Optimizations (EM4J, ) Virtual Datacenter Cloud Infrastructure and Management Automated App Provisioning (AppDirector) 40

Big Data SaaS Cetas 41

Data Sources 42

On-Premise Installation 43

Cloud-based Installation 44

45 Summary