Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack. Chief Architect RainStor

Similar documents
Microsoft Big Data and Hadoop

VOLTDB + HP VERTICA. page

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Data Storage Infrastructure at Facebook

@Pentaho #BigDataWebSeries

April Copyright 2013 Cloudera Inc. All rights reserved.

Stages of Data Processing

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Juxtaposition of Apache Tez and Hadoop MapReduce on Hadoop Cluster - Applying Compression Algorithms

10 Million Smart Meter Data with Apache HBase

Evolving To The Big Data Warehouse

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp

Evolution of Big Data Facebook. Architecture Summit, Shenzhen, August 2012 Ashish Thusoo

Big Data Facebook

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

Cloudera Kudu Introduction

MAPR TECHNOLOGIES, INC. TECHNICAL BRIEF APRIL 2017 MAPR SNAPSHOTS

Data-Intensive Distributed Computing

Hadoop An Overview. - Socrates CCDH

How to Write Data to HDFS

microsoft

Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

MapR Enterprise Hadoop

BIG DATA COURSE CONTENT

Cloud Computing & Visualization

Achieving Horizontal Scalability. Alain Houf Sales Engineer

Time Series Storage with Apache Kudu (incubating)

Best Practices and Performance Tuning on Amazon Elastic MapReduce

5 Fundamental Strategies for Building a Data-centered Data Center

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

The age of Big Data Big Data for Oracle Database Professionals

2014 年 3 月 13 日星期四. From Big Data to Big Value Infrastructure Needs and Huawei Best Practice

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM

Big Data Architect.

Challenges for Data Driven Systems

Top Trends in DBMS & DW

Security and Performance advances with Oracle Big Data SQL

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

Acquiring Big Data to Realize Business Value

QLIK INTEGRATION WITH AMAZON REDSHIFT

Capture Business Opportunities from Systems of Record and Systems of Innovation

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Automating Information Lifecycle Management with

A BigData Tour HDFS, Ceph and MapReduce

TESTING BIG DATA WORLD RIGA. by Konstantin Pletenev OCTOBER, 2017, TAPOST GROW CONFIDENTLY

Warehouse- Scale Computing and the BDAS Stack

Cloud Analytics and Business Intelligence on AWS

BigInsights and Cognos Stefan Hubertus, Principal Solution Specialist Cognos Wilfried Hoge, IT Architect Big Data IBM Corporation

CloudExpo November 2017 Tomer Levi

Data Reduction Meets Reality What to Expect From Data Reduction

Apache Kudu. Zbigniew Baranowski

Oracle Big Data Connectors

10/29/2013. Program Agenda. The Database Trifecta: Simplified Management, Less Capacity, Better Performance

Big Data Infrastructure at Spotify

Hadoop Overview. Lars George Director EMEA Services

A Review Paper on Big data & Hadoop

The New World of Big Data is The Phenomenon of Our Time

Big Data on AWS. Big Data Agility and Performance Delivered in the Cloud. 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Agenda. AWS Database Services Traditional vs AWS Data services model Amazon RDS Redshift DynamoDB ElastiCache

Presented by Nanditha Thinderu

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk

Modernizing Business Intelligence and Analytics

SAP IQ - Business Intelligence and vertical data processing with 8 GB RAM or less

Hadoop Online Training

Big Data Analytics using Apache Hadoop and Spark with Scala

Exadata Implementation Strategy

An Introduction to Big Data Formats

Intro to Big Data on AWS Igor Roiter Big Data Cloud Solution Architect

SQL Server Everything built-in

The Technology of the Business Data Lake. Appendix

Sempala. Interactive SPARQL Query Processing on Hadoop

WHITEPAPER. MemSQL Enterprise Feature List

EsgynDB Enterprise 2.0 Platform Reference Architecture

Part 1: Indexes for Big Data

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Big Data on AWS. Peter-Mark Verwoerd Solutions Architect

Modern Data Warehouse The New Approach to Azure BI

Gain Insights From Unstructured Data Using Pivotal HD. Copyright 2013 EMC Corporation. All rights reserved.

I am: Rana Faisal Munir

What is Gluent? The Gluent Data Platform

Introduc)on to Apache Ka1a. Jun Rao Co- founder of Confluent

IBM Data Retrieval Technologies: RDBMS, BLU, IBM Netezza, and Hadoop

Understanding NoSQL Database Implementations

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Data Lake Best Practices

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Ghislain Fourny. Big Data 5. Column stores

The Reality of Qlik and Big Data. Chris Larsen Q3 2016

HANA Performance. Efficient Speed and Scale-out for Real-time BI

Data contains value and knowledge

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Integrated and Hyper-converged Data Protection

Přehled novinek v SQL Server 2016

Transcription:

Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack Chief Architect RainStor

Agenda Importance of Hadoop + data compression Data compression techniques Compression, projection and filtering for faster analytics Boost performance by avoiding the Map-Reduce layer

Introducing Hadoop Hadoop is a scalable and reliable platform for storing and processing Big Data Hive Pig Map-Reduce SQL HBase HDFS

Multi-Structured Data Management Tabular Key Value Blob/Clob Relational Event logs XML Documents Binary Objects

Hadoop - The Price of Scalability and Reliability 3x data replication for availability I/O bound operations, and CPU under-utilization Heavyweight Map-Reduce framework Larger cluster sizes More Admin & More Cost

Why Data Compression is Important Reduces cluster size and boosts efficiency Reduces network and disk IO Increases CPU utilization Lowers impact of replication and shuffle Compress data at rest in HDFS Compress data between Map-Reduce tasks Compress output result sets

Compression Techniques General purpose compression libraries: Bzip2, Gzip, LZO, Snappy, LZ4 Applies to all data types Column-oriented compression: Hive columnar file format Proprietary data de-duplication Applies to structured data

Column-Oriented Compressed File Formats Structured data stored in a column-wise order Values in the same column are often related Related values are localized on disk Generic compression algos are more effective Column-level de-duplication can be applied Direct access to columns of interest Projections Reduces IO overhead for analytics workloads

Example: Compression of FS Stock Quote Data 0 5 10 15 20 25 30 35 40 6X 11X 14X 23X 45 50 LZO GZIP BZIP2 Structured data de-duplication Ratios vs uncompressed source data

Structured Data De-duplication 0001 GOOG P 100 0002 Z 250 0003 YHOO 0004 No re-inflation

Hadoop Use Case: Data Warehouse Archiving Data Warehouse [Ingest ½ TB 10TB /Day] Sources: 6,000 DBs Re-instate From Tape 500 TB 20x Compression 10 Years 2PB Applies structured data de-duplication Large bank manages Petabyte data on Hadoop for lower cost analysis and compliance Data warehouse too costly to scale (2 years of data = 500TB) Eliminate Offline tape

Boosting Analytics Performance using Filtering Data often has a natural order: Database transactions ordered by timestamp Or data may have an order imposed on it: Stock quotes partitioned by ticker symbol Order can be used to filter out tasks Only process data in the window of interest

Static Dynamic Filter for Performance Pig statement FILTER quotes BY symbol == GOOG AND exchange == P ; Bloom Filter P P P P P P HDFS P P P P P P

Map-Reduce Framework is Heavyweight Hadoop is best suited to batch analytics Task scheduling takes a few seconds HBase achieves fast performance by avoiding Map-Reduce layer Some SQL DBs can also run on HDFS Implement their own compute frameworks HDFS used as a scalable storage platform

Hadoop and Data Warehouses are Converging Data Warehouse Low Latency Analytics Costly to Scale Built upon SQL Standards Batch Analytics Low Cost Immature BI Support + SQL Hybrid SQL-Hadoop Technologies Bridge the Gap

Minutes Example: FS Stock Quote Data Analysis Batch (all symbols) Ad-hoc (single symbol) 240 4 hrs 240 4 hrs 180 180 120 1hr 20 mins 120 60 60 20 mins 2 mins 8 seconds 0 0 Uncompressed MR All Data Dedup + Project MR All Data Dedup + Project SQL All Data Dedup + Uncompressed Project MR All Data MR w/ Filter Dedup + Project SQL w/filter Query Test: Calculate the average daily quote price for ticker symbols for one day across 1.5 billion quotes

Summary Structured data compression increases storage capacity and cluster efficiency Compression is an IO bandwidth multiplier Filtering and projection can significantly boost the performance of Map-Reduce tasks Only load and process the data required The new breed of SQL on Hadoop databases provide: Faster query response times than standard Hadoop Full access to enterprise-grade Business Intelligence tools

Questions? Mark Cusack mark.cusack@rainstor.com