Stinger Initiative. Making Hive 100X Faster. Page 1. Hortonworks Inc. 2013
|
|
- Brianne Marylou Morton
- 5 years ago
- Views:
Transcription
1 Stinger Initiative Making Hive 100X Faster Page 1
2 HDP: Enterprise Hadoop Distribution OPERATIONAL SERVICES Manage AMBARI & Operate at Scale OOZIE HADOOP CORE FLUME SQOOP DATA SERVICES PIG Store, HIVE Process and Access Data HCATALOG HBASE WEBHDFS Distributed MAP REDUCE Storage HDFS & Processing YARN (in 2.0) Hortonworks Data Platform (HDP) Enterprise Hadoop The ONLY 100% open source and complete distribution PLATFORM SERVICES Enterprise Readiness: HA, DR, Snapshots, Security, HORTONWORKS DATA PLATFORM (HDP) OS Cloud VM Appliance Enterprise grade, proven and tested at true scale (45,000+ nodes) Ecosystem endorsed to ensure interoperability Page 2
3 Hive Maturity and Stability " Hive was originally developed at Facebook. " More data than existing RDBMS could handle. " 60,000+ Hive queries per day. " More than 1,000 users per day. " 100+ PB of data. " 15+ TB of data loaded daily. " Hive is a proven solution at extreme scale. Page 3
4 Hive: Vibrant & Existing Ecosystem Teradata, Microsoft Microstrategy, Tableau Karmasphere Datameer Information Builders SAP, Oracle, Actuate QlikView, SAS, Arcplan Pentaho, Jaspersoft Tibco, Talend, Informatica and more Vendors End Users Open Source Facebook, Teradata SAP, Intel, Twitter Microsoft, Huawei, Yahoo, Qubole, Citus Data, NexR InMobi and more Hive is the De-Facto SQL-for-Hadoop Solution Hive is Proven, Robust, Scalable Hive supports THE most BI Use Cases but Hive is currently optimized for Batch Processing. Page 4
5 So Innovate & Invest in Hive Parameterized Reports Enterprise Reports Dashboard / Scorecard Visualization Data Mining Users Want: More SQL Better Performance Interactive Batch Page 5
6 Stinger: Faster and Improved Insight on Hive Performance Op+miza+ons 100X+ Faster Time to Insight Deeper Analy+cal Capabili+es Base OpGmizaGons Generate simplified DAGs In- memory Hash Joins YARN Next- gen Hadoop data processing framework Tez Express tasks more simply Eliminate disk writes Pre- warmed Containers + + ORCFile Column Store High Compression Predicate / Filter Pushdowns Vector Query Engine Op>mized for modern processor architectures Query Planner Intelligent Cost- Based Op>mizer
7 Stinger: Faster and Improved Insight on Hive Performance Op+miza+ons 100X+ Faster Time to Insight Deeper Analy+cal Capabili+es Base OpGmizaGons Generate simplified DAGs In- memory Hash Joins YARN Next- gen Hadoop data processing framework Tez Express tasks more simply Eliminate disk writes Pre- warmed Containers + + ORCFile Column Store High Compression Predicate / Filter Pushdowns Vector Query Engine Op>mized for modern processor architectures Query Planner Intelligent Cost- Based Op>mizer
8 Improved Analytics: ROLLUP, CUBE select state, year, sum(amt_paid) Performance select state, year, Persistence sum(amt_paid) from sales from sales group by state, year with rollup group by state, year with cube State Year Sum CA CA CA * NY NY * * * State Year Sum CA CA CA * NY NY * * * * * Page 8
9 Definitely Room for Improvement. Simple analytics can turn into unintuitive and inefficient queries select count(*) as rk, s2.state as state, s2.product as product, avg(s2.amt_paid), sum(s1.amt_paid) from sales s1 join sales s2 on (s1.product = s2.product and s1.state = s2.state) where s1.year <= s2.year group by s2.state, s2.product, s2.year order by state, product, rk; Performance Persistence Page 9
10 Definitely Room for Improvement. This is all we were trying to do! Running total of Sales Figures! Performance Number State Product Amount Total 1 CA A CA A CA A CA A CA B CA B Persistence Page 10
11 Improved Analytics: OVER MUCH Better! Performance Persistence select rank() over state_and_product, state, product, amt_paid, sum(amt_paid) over state_and_product from sales window state_and_product as (partition by state, product order by year); Page 11
12 Improved Analytics: OVER partition by order by Performance AL CA CA CA CA NY Persistence rows OVER clause PARTITION BY, ORDER BY, ROWS BETWEEN/FOLLOWING/PRECEDING Works with current aggregate functions New aggregates/window functions RANK, LEAD, ROW_NUMBER, LAG, LEAD, FIRST_VALUE, LAST_VALUE NTILE, DENSE_RANK, CUME_DIST, PERCENT_RANK, PERCENT_CONT, PERCENT_DISC Page 12
13 Additional Improved Analytics Sub-Queries in WHERE Performance Non-correlated only (no values from outer query) [NOT] IN supported Fit in memory as hash table when feasible Persistence Additional Standard SQL data types datetime char() and varchar() add precision and scale to decimal and float aliases for standard SQL types (BLOB = binary, CLOB = string, integer = int, real/number = decimal) Page 13
14 Stinger: Faster and Improved Insight on Hive Performance Op+miza+ons 100X+ Faster Time to Insight Deeper Analy+cal Capabili+es Base OpGmizaGons Generate simplified DAGs In- memory Hash Joins YARN Next- gen Hadoop data processing framework Tez Express tasks more simply Eliminate disk writes Pre- warmed Containers + + ORCFile Column Store High Compression Predicate / Filter Pushdowns Vector Query Engine Op>mized for modern processor architectures Query Planner Intelligent Cost- Based Op>mizer
15 Automatic Join Conversion Bucketed? Sorted? Small enough? Small enough? Sort Merge Bucket Join Shuffle Join Map Join Shuffle Join Map Join When enabled hive will automatically pick join implementation Query Hints No Longer Needed
16 Dimensionally Structured Data Extremely common pattern in EDW Large fact tables and small dimension tables Dimension tables often fit in RAM Oftentimes called Star Schema Page 16
17 Star Schema Join Derived from TPC-DS Query 27 SELECT col5, avg(col6) FROM fact_table join dim1 on (fact_table.col1 = dim1.col1) join dim2 on (fact_table.col2 = dim2.col1) join dim3 on (fact_table.col3 = dim3.col1) join dim4 on (fact_table.col4 = dim4.col1) GROUP BY col5 ORDER BY col5 LIMIT 100; Dramatic speedup on Hive 0.11 Page 17
18 Star Schema Joins BEFORE Page 18
19 Star Schema Joins AFTER
20 Star Schema Join Performance 35X Improvement (More to Come) Page 20
21 Large Tables Join BEFORE Stage-3 17 Stage-2 17 Start Time Stage Stage
22 Hive Sort Merge Bucket Join Bucketing allows Hive to physically co-locate rows within files (sorted or unsorted) CREATE EXTERNAL TABLE IF NOT EXISTS test_table ( ) Id INT, name String PARTITIONED BY (dt STRING, hour STRING) CLUSTERED BY(country,continent) SORTED BY(country,continent) INTO n BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' LOCATION '/home/test_dir ; Join on two large tables share same sorted bucket key? Very efficient joins (minimize shuffles) Page 22
23 Large Tables Join AFTER Stage
24 Sort-Merge-Bucket Join Performance 45X Improvement (More to Come) Page 24
25 Summary: Early Benchmarking Results TPC DS Sample Queries Query One (Left): Star Schema Join, 35X improvement Query Two (Right): Join two tables, too large to fit in memory, 45X speedup MUCH More to Come as we make our way to 100X and Beyond! Page 25
26 Stinger: Faster and Improved Insight on Hive Performance Op+miza+ons 100X+ Faster Time to Insight Deeper Analy+cal Capabili+es Base OpGmizaGons Generate simplified DAGs In- memory Hash Joins YARN Next- gen Hadoop data processing framework Tez Express tasks more simply Eliminate disk writes Pre- warmed Containers + + ORCFile Column Store High Compression Predicate / Filter Pushdowns Vector Query Engine Op>mized for modern processor architectures Query Planner Intelligent Cost- Based Op>mizer
27 ORCFile Even Faster Query-Optimized: Split-able, columnar storage file Efficient Reads: Break into large stripes of data for efficient read Fast Filtering: Built in index, min/max, metadata for fast filtering blocks - bloom filters if desired Efficient Compression: Decompose complex row types into primitives, runlength encoding => massive compression and efficient comparisons for filtering Pre-computation: Built in aggregates per block (min, max, count, sum) Page 27
28 ORCFile High Compression Data set from TPC-DS Page 28
29 Stinger: Faster and Improved Insight on Hive Performance Op+miza+ons 100X+ Faster Time to Insight Deeper Analy+cal Capabili+es Base OpGmizaGons Generate simplified DAGs In- memory Hash Joins YARN Next- gen Hadoop data processing framework Tez Express tasks more simply Eliminate disk writes Pre- warmed Containers + + ORCFile Column Store High Compression Predicate / Filter Pushdowns Vector Query Engine Op>mized for modern processor architectures Query Planner Intelligent Cost- Based Op>mizer
30 Longer Term Hive Performance Rewrite all operations to operate on blocks of 1K+ records, rather than one record at a time Block is array of Java scalars, not Objects (eliminate Objects compounding GC gains over time) Avoids many function calls, CPU pipeline stalls Size to fit in L1 cache, avoid cache misses Cost Based Optimizer: Generate better DAGs based on properties of data being queried: table size, statistics, histograms, etc. Buffer/Cache Data Hotspots Page 30
31 Stinger: Faster and Improved Insight on Hive Performance Op+miza+ons 100X+ Faster Time to Insight Deeper Analy+cal Capabili+es Base OpGmizaGons Generate simplified DAGs In- memory Hash Joins YARN Next- gen Hadoop data processing framework Tez Express tasks more simply Eliminate disk writes Pre- warmed Containers + + ORCFile Column Store High Compression Predicate / Filter Pushdowns Vector Query Engine Op>mized for modern processor architectures Query Planner Intelligent Cost- Based Op>mizer
32 Tez: High Throughput and Low Latency Tez Generalizes Map-Reduce Simplified execution plans process data more efficiently Always-On Tez Service Low latency processing for all Hadoop data processing Page 32
33 Tez: High Throughput and Low Latency Node Manager Tez runs in YARN Container App Mstr Client Client Resource Manager Node Manager MapReduce Status Job Submission Node Status Resource Request App Mstr Container Node Manager Container Container Accelerate High Throughput AND Low Latency Processing
34 Tez: Core Idea Task with pluggable Input, Processor & Output Input Processor Output Task YARN ApplicationMaster runs DAG of Tez Tasks Page 34
35 Tez Hive Performance Low level data-processing, execution engine on YARN Base for MapReduce, Hive, Pig, Cascading, etc. Re-usable data processing primitives (ex: sort, merge, intermediate data management) Hive SQL can be expressed as single job Jobs are no longer interrupted (efficient pipeline) Avoid writing intermediate output to HDFS when performance outweights job re-start (speed and network/disk usage savings) Break MR contract to turn MRMRMR to MRRR (flexible DAG) Removes task and job overhead (10-30s savings is huge for a 2s query!) Page 35
36 Pig/Hive optimized on Tez SELECT a.state, COUNT(*) FROM a JOIN b ON (a.id = b.id) GROUP BY a.state I/O Synchronization Barrier I/O Pipelining MapReduce TEZ Page 36
37 Pig/Hive optimized on Tez SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a JOIN b ON (a.id = b.id) JOIN c ON (a.itemid = c.itemid) GROUP BY a.state Job 1 Single Job Job 2 I/O Synchronization Barrier I/O Synchronization Barrier Job 3 MapReduce TEZ Page 37
38 Innovation via Community Performance Op+miza+ons 100X+ Faster Time to Insight Deeper Analy+cal Capabili+es
April Copyright 2013 Cloudera Inc. All rights reserved.
Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and the Virtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here April 2014 Analytic Workloads on
More informationHadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here
Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here 2013-11-12 Copyright 2013 Cloudera
More informationInteractive Query With Apache Hive
Interactive Query With Apache Hive Ajay Singh Dec Page 1 4, 2014 Agenda HDP 2.2 Apache Hive & Stinger Initiative Stinger.Next Putting It Together Q&A Page 2 HDP 2.2 Generally Available GOVERNANCE Hortonworks
More informationApache Hive for Oracle DBAs. Luís Marques
Apache Hive for Oracle DBAs Luís Marques About me Oracle ACE Alumnus Long time open source supporter Founder of Redglue (www.redglue.eu) works for @redgluept as Lead Data Architect @drune After this talk,
More informationmicrosoft
70-775.microsoft Number: 70-775 Passing Score: 800 Time Limit: 120 min Exam A QUESTION 1 Note: This question is part of a series of questions that present the same scenario. Each question in the series
More informationMODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS
MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANT SCALE www.elephantscale.com sujee@elephantscale.com HI, I M SUJEE MANIYAM Founder / Principal @ ElephantScale
More informationBig Data Hadoop Stack
Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware
More information20461: Querying Microsoft SQL Server
20461: Querying Microsoft SQL Server Length: 5 days Audience: IT Professionals Level: 300 OVERVIEW This 5 day instructor led course provides students with the technical skills required to write basic Transact
More informationApache Hadoop.Next What it takes and what it means
Apache Hadoop.Next What it takes and what it means Arun C. Murthy Founder & Architect, Hortonworks @acmurthy (@hortonworks) Page 1 Hello! I m Arun Founder/Architect at Hortonworks Inc. Lead, Map-Reduce
More informationNew Approaches to Big Data Processing and Analytics
New Approaches to Big Data Processing and Analytics Contributing authors: David Floyer, David Vellante Original publication date: February 12, 2013 There are number of approaches to processing and analyzing
More informationAfter completing this course, participants will be able to:
Querying SQL Server T h i s f i v e - d a y i n s t r u c t o r - l e d c o u r s e p r o v i d e s p a r t i c i p a n t s w i t h t h e t e c h n i c a l s k i l l s r e q u i r e d t o w r i t e b a
More informationORC Files. Owen O June Page 1. Hortonworks Inc. 2012
ORC Files Owen O Malley owen@hortonworks.com @owen_omalley owen@hortonworks.com June 2013 Page 1 Who Am I? First committer added to Hadoop in 2006 First VP of Hadoop at Apache Was architect of MapReduce
More informationOracle Big Data Connectors
Oracle Big Data Connectors Oracle Big Data Connectors is a software suite that integrates processing in Apache Hadoop distributions with operations in Oracle Database. It enables the use of Hadoop to process
More informationIntroduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data
Introduction to Hadoop High Availability Scaling Advantages and Challenges Introduction to Big Data What is Big data Big Data opportunities Big Data Challenges Characteristics of Big data Introduction
More informationHive SQL over Hadoop
Hive SQL over Hadoop Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Introduction Apache Hive is a high-level abstraction on top of MapReduce Uses
More informationHadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)
Bigdata Fundamentals Day1: (2hours) 1. Understanding BigData. a. What is Big Data? b. Big-Data characteristics. c. Challenges with the traditional Data Base Systems and Distributed Systems. 2. Distributions:
More informationQuerying Microsoft SQL Server (MOC 20461C)
Querying Microsoft SQL Server 2012-2014 (MOC 20461C) Course 21461 40 Hours This 5-day instructor led course provides students with the technical skills required to write basic Transact-SQL queries for
More informationCIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )
Guide: CIS 601 Graduate Seminar Presented By: Dr. Sunnie S. Chung Dhruv Patel (2652790) Kalpesh Sharma (2660576) Introduction Background Parallel Data Warehouse (PDW) Hive MongoDB Client-side Shared SQL
More informationHadoop. Introduction / Overview
Hadoop Introduction / Overview Preface We will use these PowerPoint slides to guide us through our topic. Expect 15 minute segments of lecture Expect 1-4 hour lab segments Expect minimal pretty pictures
More informationHadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved
Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop
More informationThe Technology of the Business Data Lake. Appendix
The Technology of the Business Data Lake Appendix Pivotal data products Term Greenplum Database GemFire Pivotal HD Spring XD Pivotal Data Dispatch Pivotal Analytics Description A massively parallel platform
More informationCOURSE OUTLINE: Querying Microsoft SQL Server
Course Name 20461 Querying Microsoft SQL Server Course Duration 5 Days Course Structure Instructor-Led (Classroom) Course Overview This 5-day instructor led course provides students with the technical
More informationHive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)
Hive and Shark Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 1 / 45 Motivation MapReduce is hard to
More informationQuerying Microsoft SQL Server
Querying Microsoft SQL Server Duration: 5 Days (08:30-16:00) Overview: This course provides students with the technical skills required to write basic Transact-SQL queries for Microsoft SQL Server. This
More informationIntroduction to BigData, Hadoop:-
Introduction to BigData, Hadoop:- Big Data Introduction: Hadoop Introduction What is Hadoop? Why Hadoop? Hadoop History. Different types of Components in Hadoop? HDFS, MapReduce, PIG, Hive, SQOOP, HBASE,
More informationCOURSE OUTLINE MOC 20461: QUERYING MICROSOFT SQL SERVER 2014
COURSE OUTLINE MOC 20461: QUERYING MICROSOFT SQL SERVER 2014 MODULE 1: INTRODUCTION TO MICROSOFT SQL SERVER 2014 This module introduces the SQL Server platform and major tools. It discusses editions, versions,
More informationCourse 20461C: Querying Microsoft SQL Server
Course 20461C: Querying Microsoft SQL Server Audience Profile About this Course This course is the foundation for all SQL Serverrelated disciplines; namely, Database Administration, Database Development
More informationStages of Data Processing
Data processing can be understood as the conversion of raw data into a meaningful and desired form. Basically, producing information that can be understood by the end user. So then, the question arises,
More information20461: Querying Microsoft SQL Server 2014 Databases
Course Outline 20461: Querying Microsoft SQL Server 2014 Databases Module 1: Introduction to Microsoft SQL Server 2014 This module introduces the SQL Server platform and major tools. It discusses editions,
More informationHow Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,
How Apache Hadoop Complements Existing BI Systems Dr. Amr Awadallah Founder, CTO Cloudera, Inc. Twitter: @awadallah, @cloudera 2 The Problems with Current Data Systems BI Reports + Interactive Apps RDBMS
More informationCERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)
CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI) The Certificate in Software Development Life Cycle in BIGDATA, Business Intelligence and Tableau program
More informationMapR Enterprise Hadoop
2014 MapR Technologies 2014 MapR Technologies 1 MapR Enterprise Hadoop Top Ranked Cloud Leaders 500+ Customers 2014 MapR Technologies 2 Key MapR Advantage Partners Business Services APPLICATIONS & OS ANALYTICS
More informationQuerying Microsoft SQL Server
Querying Microsoft SQL Server Course 20461D 5 Days Instructor-led, Hands-on Course Description This 5-day instructor led course is designed for customers who are interested in learning SQL Server 2012,
More informationHDInsight > Hadoop. October 12, 2017
HDInsight > Hadoop October 12, 2017 2 Introduction Mark Hudson >20 years mixing technology with data >10 years with CapTech Microsoft Certified IT Professional Business Intelligence Member of the Richmond
More informationHortonworks Data Platform
Hortonworks Data Platform Apache Hive Performance Tuning (October 30, 2017) docs.hortonworks.com Hortonworks Data Platform: Apache Hive Performance Tuning Copyright 2012-2017 Hortonworks, Inc. Some rights
More informationQuerying Microsoft SQL Server
Querying Microsoft SQL Server 20461D; 5 days, Instructor-led Course Description This 5-day instructor led course provides students with the technical skills required to write basic Transact SQL queries
More informationThe Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou
The Hadoop Ecosystem EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca A lot of tools designed to work with Hadoop 2 HDFS, MapReduce Hadoop Distributed File System Core Hadoop component
More informationQuerying Microsoft SQL Server 2012/2014
Page 1 of 14 Overview This 5-day instructor led course provides students with the technical skills required to write basic Transact-SQL queries for Microsoft SQL Server 2014. This course is the foundation
More informationHadoop An Overview. - Socrates CCDH
Hadoop An Overview - Socrates CCDH What is Big Data? Volume Not Gigabyte. Terabyte, Petabyte, Exabyte, Zettabyte - Due to handheld gadgets,and HD format images and videos - In total data, 90% of them collected
More informationQuerying Microsoft SQL Server
20461 - Querying Microsoft SQL Server Duration: 5 Days Course Price: $2,975 Software Assurance Eligible Course Description About this course This 5-day instructor led course provides students with the
More informationEvolution of Big Data Facebook. Architecture Summit, Shenzhen, August 2012 Ashish Thusoo
Evolution of Big Data Architectures@ Facebook Architecture Summit, Shenzhen, August 2012 Ashish Thusoo About Me Currently Co-founder/CEO of Qubole Ran the Data Infrastructure Team at Facebook till 2011
More informationexam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0
70-775.exam Number: 70-775 Passing Score: 800 Time Limit: 120 min File Version: 1.0 Microsoft 70-775 Perform Data Engineering on Microsoft Azure HDInsight Version 1.0 Exam A QUESTION 1 You use YARN to
More informationBig Data Hadoop Course Content
Big Data Hadoop Course Content Topics covered in the training Introduction to Linux and Big Data Virtual Machine ( VM) Introduction/ Installation of VirtualBox and the Big Data VM Introduction to Linux
More informationBig Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara
Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case
More informationShark: Hive (SQL) on Spark
Shark: Hive (SQL) on Spark Reynold Xin UC Berkeley AMP Camp Aug 21, 2012 UC BERKELEY SELECT page_name, SUM(page_views) views FROM wikistats GROUP BY page_name ORDER BY views DESC LIMIT 10; Stage 0: Map-Shuffle-Reduce
More informationMaking the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack. Chief Architect RainStor
Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack Chief Architect RainStor Agenda Importance of Hadoop + data compression Data compression techniques Compression,
More informationBig Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours
Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals
More informationSQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism
Big Data and Hadoop with Azure HDInsight Andrew Brust Senior Director, Technical Product Marketing and Evangelism Datameer Level: Intermediate Meet Andrew Senior Director, Technical Product Marketing and
More informationCISC 7610 Lecture 2b The beginnings of NoSQL
CISC 7610 Lecture 2b The beginnings of NoSQL Topics: Big Data Google s infrastructure Hadoop: open google infrastructure Scaling through sharding CAP theorem Amazon s Dynamo 5 V s of big data Everyone
More informationShark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker
Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha
More informationLarge-Scale Data Engineering. Modern SQL-on-Hadoop Systems
Large-Scale Data Engineering Modern SQL-on-Hadoop Systems Analytical Database Systems Parallel (MPP): Teradata Paraccel Pivotal Vertica Redshift Oracle (IMM) DB2-BLU SQLserver (columnstore) Netteza InfoBright
More informationExam Questions
Exam Questions 70-775 Perform Data Engineering on Microsoft Azure HDInsight (beta) https://www.2passeasy.com/dumps/70-775/ NEW QUESTION 1 You are implementing a batch processing solution by using Azure
More informationPart 1: Indexes for Big Data
JethroData Making Interactive BI for Big Data a Reality Technical White Paper This white paper explains how JethroData can help you achieve a truly interactive interactive response time for BI on big data,
More informationHadoop course content
course content COURSE DETAILS 1. In-detail explanation on the concepts of HDFS & MapReduce frameworks 2. What is 2.X Architecture & How to set up Cluster 3. How to write complex MapReduce Programs 4. In-detail
More informationQUERYING MICROSOFT SQL SERVER COURSE OUTLINE. Course: 20461C; Duration: 5 Days; Instructor-led
CENTER OF KNOWLEDGE, PATH TO SUCCESS Website: QUERYING MICROSOFT SQL SERVER Course: 20461C; Duration: 5 Days; Instructor-led WHAT YOU WILL LEARN This 5-day instructor led course provides students with
More informationShark: Hive (SQL) on Spark
Shark: Hive (SQL) on Spark Reynold Xin UC Berkeley AMP Camp Aug 29, 2013 UC BERKELEY Stage 0:M ap-shuffle-reduce M apper(row ) { fields = row.split("\t") em it(fields[0],fields[1]); } Reducer(key,values)
More informationInnovatus Technologies
HADOOP 2.X BIGDATA ANALYTICS 1. Java Overview of Java Classes and Objects Garbage Collection and Modifiers Inheritance, Aggregation, Polymorphism Command line argument Abstract class and Interfaces String
More informationHadoop & Big Data Analytics Complete Practical & Real-time Training
An ISO Certified Training Institute A Unit of Sequelgate Innovative Technologies Pvt. Ltd. www.sqlschool.com Hadoop & Big Data Analytics Complete Practical & Real-time Training Mode : Instructor Led LIVE
More informationWelcome to. uweseiler
5.03.014 Welcome to uweseiler 5.03.014 Your Travel Guide Big Data Nerd Hadoop Trainer NoSQL Fan Boy Photography Enthusiast Travelpirate 5.03.014 Your Travel Agency specializes on... Big Data Nerds Agile
More informationDelving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture
Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Hadoop 1.0 Architecture Introduction to Hadoop & Big Data Hadoop Evolution Hadoop Architecture Networking Concepts Use cases
More informationQuerying Microsoft SQL Server 2008/2012
Querying Microsoft SQL Server 2008/2012 Course 10774A 5 Days Instructor-led, Hands-on Introduction This 5-day instructor led course provides students with the technical skills required to write basic Transact-SQL
More information1 Dulcian, Inc., 2001 All rights reserved. Oracle9i Data Warehouse Review. Agenda
Agenda Oracle9i Warehouse Review Dulcian, Inc. Oracle9i Server OLAP Server Analytical SQL Mining ETL Infrastructure 9i Warehouse Builder Oracle 9i Server Overview E-Business Intelligence Platform 9i Server:
More informationHortonworks and The Internet of Things
Hortonworks and The Internet of Things Dr. Bernhard Walter Solutions Engineer About Hortonworks Customer Momentum ~700 customers (as of November 4, 2015) 152 customers added in Q3 2015 Publicly traded
More informationThe Reality of Qlik and Big Data. Chris Larsen Q3 2016
The Reality of Qlik and Big Data Chris Larsen Q3 2016 Introduction Chris Larsen Sr Solutions Architect, Partner Engineering @Qlik Based in Lund, Sweden Primary Responsibility Advanced Analytics (and formerly
More informationAVANTUS TRAINING PTE LTD
[MS20461]: Querying Microsoft SQL Server 2014 Length : 5 Days Audience(s) : IT Professionals Level : 300 Technology : SQL Server Delivery Method : Instructor-led (Classroom) Course Overview This 5-day
More informationQuerying Microsoft SQL Server
Course Code: M20461 Vendor: Microsoft Course Overview Duration: 5 RRP: POA Querying Microsoft SQL Server Overview This 5-day instructor led course provides delegates with the technical skills required
More informationBig Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture
Big Data Syllabus Hadoop YARN Setup Programming in YARN framework j Understanding big data and Hadoop Big Data Limitations and Solutions of existing Data Analytics Architecture Hadoop Features Hadoop Ecosystem
More informationIn-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet
In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta iorhian@gmail.com Radu Chilom radu.chilom@gmail.com Big data analytics / machine learning 6+ years
More informationInteractive SQL-on-Hadoop from Impala to Hive/Tez to Spark SQL to JethroData
Interactive SQL-on-Hadoop from Impala to Hive/Tez to Spark SQL to JethroData ` Ronen Ovadya, Ofir Manor, JethroData About JethroData Founded 2012 Raised funding from Pitango in 2013 Engineering in Israel,
More informationIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large
More informationdocs.hortonworks.com
docs.hortonworks.com : Getting Started Guide Copyright 2012, 2014 Hortonworks, Inc. Some rights reserved. The, powered by Apache Hadoop, is a massively scalable and 100% open source platform for storing,
More informationOverview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::
Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized
More informationIncrease Value from Big Data with Real-Time Data Integration and Streaming Analytics
Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics Cy Erbay Senior Director Striim Executive Summary Striim is Uniquely Qualified to Solve the Challenges of Real-Time
More informationHortonworks Certified Developer (HDPCD Exam) Training Program
Hortonworks Certified Developer (HDPCD Exam) Training Program Having this badge on your resume can be your chance of standing out from the crowd. The HDP Certified Developer (HDPCD) exam is designed for
More informationHortonworks Data Platform
Hortonworks Data Platform Apache Hive Performance Tuning (July 12, 2018) docs.hortonworks.com Hortonworks Data Platform: Apache Hive Performance Tuning Copyright 2012-2018 Hortonworks, Inc. Some rights
More informationAnalysis in the Big Data Era
Analysis in the Big Data Era Massive Data Data Analysis Insight Key to Success = Timely and Cost-Effective Analysis 2 Hadoop MapReduce Ecosystem Popular solution to Big Data Analytics Java / C++ / R /
More informationSecurity and Performance advances with Oracle Big Data SQL
Security and Performance advances with Oracle Big Data SQL Jean-Pierre Dijcks Oracle Redwood Shores, CA, USA Key Words SQL, Oracle, Database, Analytics, Object Store, Files, Big Data, Big Data SQL, Hadoop,
More informationShark: SQL and Rich Analytics at Scale. Reynold Xin UC Berkeley
Shark: SQL and Rich Analytics at Scale Reynold Xin UC Berkeley Challenges in Modern Data Analysis Data volumes expanding. Faults and stragglers complicate parallel database design. Complexity of analysis:
More informationHadoop Online Training
Hadoop Online Training IQ training facility offers Hadoop Online Training. Our Hadoop trainers come with vast work experience and teaching skills. Our Hadoop training online is regarded as the one of the
More informationMIS NETWORK ADMINISTRATOR PROGRAM
NH107-7475 SQL: Querying and Administering SQL Server 2012-2014 136 Total Hours 97 Theory Hours 39 Lab Hours COURSE TITLE: SQL: Querying and Administering SQL Server 2012-2014 PREREQUISITE: Before attending
More informationBig Data. Big Data Analyst. Big Data Engineer. Big Data Architect
Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION
More informationBig Data Analytics using Apache Hadoop and Spark with Scala
Big Data Analytics using Apache Hadoop and Spark with Scala Training Highlights : 80% of the training is with Practical Demo (On Custom Cloudera and Ubuntu Machines) 20% Theory Portion will be important
More informationYARN: A Resource Manager for Analytic Platform Tsuyoshi Ozawa
YARN: A Resource Manager for Analytic Platform Tsuyoshi Ozawa ozawa.tsuyoshi@lab.ntt.co.jp ozawa@apache.org About me Tsuyoshi Ozawa Research Engineer @ NTT Twitter: @oza_x86_64 Over 150 reviews in 2015
More informationTechno Expert Solutions An institute for specialized studies!
Course Content of Big Data Hadoop( Intermediate+ Advance) Pre-requistes: knowledge of Core Java/ Oracle: Basic of Unix S.no Topics Date Status Introduction to Big Data & Hadoop Importance of Data& Data
More informationMicrosoft Querying Microsoft SQL Server 2014
1800 ULEARN (853 276) www.ddls.com.au Microsoft 20461 - Querying Microsoft SQL Server 2014 Length 5 days Price $4290.00 (inc GST) Version D Overview Please note: Microsoft have released a new course which
More informationTopics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples
Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?
More informationTrafodion Enterprise-Class Transactional SQL-on-HBase
Trafodion Enterprise-Class Transactional SQL-on-HBase Trafodion Introduction (Welsh for transactions) Joint HP Labs & HP-IT project for transactional SQL database capabilities on Hadoop Leveraging 20+
More informationCAST(HASHBYTES('SHA2_256',(dbo.MULTI_HASH_FNC( tblname', schemaname'))) AS VARBINARY(32));
>Near Real Time Processing >Raphael Klebanov, Customer Experience at WhereScape USA >Definitions 1. Real-time Business Intelligence is the process of delivering business intelligence (BI) or information
More informationMicrosoft Big Data and Hadoop
Microsoft Big Data and Hadoop Lara Rubbelke @sqlgal Cindy Gross @sqlcindy 2 The world of data is changing The 4Vs of Big Data http://nosql.mypopescu.com/post/9621746531/a-definition-of-big-data 3 Common
More informationParallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce
Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The
More informationAn Introduction to Big Data Formats
Introduction to Big Data Formats 1 An Introduction to Big Data Formats Understanding Avro, Parquet, and ORC WHITE PAPER Introduction to Big Data Formats 2 TABLE OF TABLE OF CONTENTS CONTENTS INTRODUCTION
More informationCertified Big Data and Hadoop Course Curriculum
Certified Big Data and Hadoop Course Curriculum The Certified Big Data and Hadoop course by DataFlair is a perfect blend of in-depth theoretical knowledge and strong practical skills via implementation
More informationThings Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem Zohar Elkayam www.realdbamagic.com Twitter: @realmgic Who am I? Zohar Elkayam, CTO at Brillix Programmer, DBA, team leader, database trainer,
More informationQuerying Microsoft SQL Server 2014
Querying Microsoft SQL Server 2014 Código del curso: 20461 Duración: 5 días Acerca de este curso This 5 day instructor led course provides students with the technical skills required to write basic Transact
More informationAnswer: A Reference:http://www.vertica.com/wpcontent/uploads/2012/05/MicroStrategy_Vertica_12.p df(page 1, first para)
1 HP - HP2-N44 Selling HP Vertical Big Data Solutions QUESTION: 1 When is Vertica a better choice than SAP HANA? A. The customer wants a closed ecosystem for BI and analytics, and is unconcerned with support
More informationAster Data Basics Class Outline
Aster Data Basics Class Outline CoffingDW education has been customized for every customer for the past 20 years. Our classes can be taught either on site or remotely via the internet. Education Contact:
More information20461D: Querying Microsoft SQL Server
20461D: Querying Microsoft SQL Server Course Details Course Code: Duration: Notes: 20461D 5 days This course syllabus should be used to determine whether the course is appropriate for the students, based
More informationHadoop Map Reduce 10/17/2018 1
Hadoop Map Reduce 10/17/2018 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce execution engine of Hadoop through YARN 10/17/2018
More informationBlended Learning Outline: Cloudera Data Analyst Training (171219a)
Blended Learning Outline: Cloudera Data Analyst Training (171219a) Cloudera Univeristy s data analyst training course will teach you to apply traditional data analytics and business intelligence skills
More informationBig Data Architect.
Big Data Architect www.austech.edu.au WHAT IS BIG DATA ARCHITECT? A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional
More information