Introducing Apache Kudu and RecordService (incubating)
|
|
- Erica Mills
- 5 years ago
- Views:
Transcription
1 Introducing Apache Kudu and RecordService (incubating) Guido Oswald Sales Engineer, Switzerland April 2016, Swiss Big Data User Group Meetup 1
2 Current storage landscape in Hadoop HDFS excels at: Efficiently scanning large amounts of data Accumulating data with high throughput HBase excels at: Efficiently finding and writing individual rows Making data mutable Gaps exist when these properties are needed simultaneously 2
3 Managing the gap (today) Code Complexity Manage flow and sync of data between HDFS and Hbase Monitoring and Security Managing consistent backups, security policies, monitoring and more is hard Performance Significant lag between arrival of Hbase data staging and time when data is available for analytics. 3
4 Changing hardware landscape Spinning disk -> solid state storage NAND flash: Up to 450k read 250k write IOPS, about 2GB/sec read and 1.5GB/sec write throughput, at a price of less than $3/GB and dropping 3D XPoint memory (1000x faster than NAND, cheaper than RAM) RAM is cheaper and more abundant: 64->128->256GB over last few years Takeaway 1: The next bottleneck is CPU, and current storage heavy applications weren t designed with CPU efficiency in mind Takeaway 2: Column stores are feasible for random access 4
5 Apache Kudu (Incubating) Storage for Fast Analytics on Fast Data BATCH Spark, Hive, Pig MapReduce PROCESS, ANALYZE, SERVE STREAM Spark RESOURCE MANAGEMENT YARN SQL Impala UNIFIED SERVICES SEARCH Solr SECURITY Sentry, RecordService SDK Kite New updating column store for Hadoop Simplifies the architecture for building analytic applications on changing data Designed for fast analytic performance Natively integrated with Hadoop FILESYSTEM HDFS STRUCTURED Sqoop RELATIONAL Kudu STORE INTEGRATE NoSQL HBase UNSTRUCTURED Kafka, Flume Donated as incubating project at Apache Software Foundation (November 17, 2015) Beta now available 5
6 Kudu design goals High throughput for big scans (columnar storage and replication) Goal: Within 2x of Parquet Low-latency for short accesses (primary key indexes and quorum design) Goal: 1ms read/write on SSD Database-like semantics (initially single-row ACID) Relational data model SQL query NoSQL style scan/insert/update (Java client) 6
7 Kudu basic design Apache-licensed open source software Structured data model Basic construct: tables Tables broken down into tablets (roughly equivalent to partitions) Architecture supports geographically disparate, active/active systems Not the initial design goal 7
8 What Kudu is not Not a SQL interface Just the storage layer BYOSQL Bring-your-own SQL Not a file system Data must have tabular structure Not an application that runs on HDFS An alternative, native Hadoop storage engine Not a replacement for HDFS or HBase Select the right storage for the right use case Cloudera will continue to support and invest in all three 8
9 Kudu data model Tables have a RDBMS-like schema Finite number of columns (unlike HBase/Cassandra) Types: BOOL, INT8/16/32/64, FLOAT, DOUBLE, STRING, BINARY, TIMESTAMP Some subset of columns make up a primary key Fast random reads/writes by primary key No secondary indexes (yet) Columnar layout on disk - Parquet Lazy materialization Encoding and compression options 9 9
10 Table partitioning Hash bucketing Distribute records by hash of partition column(s) N buckets leads to N tablets Range partitioning Distribute records by ranges of the partition column(s) N split keys leads to N tablets Can be a mix for different columns of the primary key 10
11 Consistency model Consistency and replication enforced by Raft consensus (similar to Paxos) Replication by operation not data Single-row transactions now Multi-row transactions later Geo-distributed replicas will be possible under strict time synchronization Techniques drawn from Google Spanner and others 11
12 Kudu interfaces NoSQL-style APIs Insert(), Update(), Delete(), Scan() Java and C++ now Python soon Integrations with MapReduce, Spark, and Impala No direct access to underlying Kudu tablet files Beta does not have authentication, authorization, encryption 12
13 Impala integration Opens up Kudu to JDBC/ODBC clients Intuitive way to get data into Kudu INSERT INTO kudu_table SELECT * FROM src_table; Additional commands UPDATE DELETE Efficient INSERT VALUES Runs on the Kudu C++ client 13
14 Performance characteristics Very CPU efficient Written in modern C++, uses specialized CPU instructions, JIT compilation with LLVM Latency dependent on storage hardware capabilities Expect sub-millisecond response on SSDs and upcoming technologies No garbage collection allows very large memory footprint with no pauses Bloom filters reduce the need for many disk accesses 14
15 Operating Kudu Easiest through Cloudera Manager integration Separate parcel for now Kudu is always compacting No minor vs. major compaction No compaction latency spikes Web UI is full of metrics and logs 15
16 Cluster layout One or multiple masters Only one in current beta Low CPU and memory impact One tablet server per worker node Can share disks with HDFS One SSD per worker node just for Kudu WAL can speed up writes No dependencies on other Hadoop ecosystem components But interfacing components like Impala or Spark do 16
17 Real-time analytics in Hadoop today Merging in new data = storage complexity Incoming Data (Messaging System) HDFS + Impala Downsides: Multiple storage layers Have we accumulated enough data? Reorganize HBase file into Parquet HBase Parquet File Historic Data Most Recent Partition New Partition Wait for running operations to complete Define new Impala partition referencing the newly written Parquet file Reporting Request Latest data is hidden Files are messy Complex to do updates without breaking running queries 17
18 Real-time analytics in Hadoop with Kudu Incoming Data (Messaging System) Kudu + Impala Improvements: One system to operate Historical and Real-time Data Reporting Request No schedules or background processes Handle late arrivals or data corrections with ease New data available immediately for analytics or operations 18
19 Kudu for data warehousing Near real time data visibility BI tools can display events that happened seconds earlier Excellent for star schemas Fast scans of deep fact tables Efficient wide fact tables Simplified updates of slowly changing dimensions 19
20 Near real time data warehousing on Kudu Simple Files FLUME HUE RDBMS K A F K A K U D U IMPALA User Streams SPARK STREAMING Complex BI tools 20
21 Resources Join the community Download the beta cloudera.com/downloads Read the whitepaper getkudu.io/kudu.pdf 21
22 Creating a Kudu table Table name in Impala does NOT match table name in Kudu. Kudu is its own storage layer. Kudu Storage handler Kudu Master hostname and port A primary key is mandatory 22
23 Spark (Scala) code DataFrame Row Kudu table name Kudu Master hostname and port Create a client, session and table object Extract values from the row, strong types Create an insert object and row Perform the actual insert Cleanup Set the values by type, column name and column valule 23
24 Kudu code examples and docs 0/topics/kudu_development.html 24
25 RecordService 25
26 Permission Enforcement today with Sentry Rule: Allow fraud analysts read access to the transaction table Sentry Enforcement Hive Server 2 Admins specify permissions Sentry Service Sentry Permissions rules Coarse grained (table) Sentry Enforcement Sentry Enforcement Impala HDFS: MR, Pig, Spark,... Apps: Datameer, Platfora, Zoomdata, etc Sentry Enforcement Search (Solr) 26
27 The Need for Fine-Grained Access Control Across all access paths Columns: Sensitive column visibility varies; Example: credit card numbers Managers: Call Centre: XXXX XXXX XXXX 5678 Analysts: XXXX XXXX XXXX XXXX Others: Does not see credit card column Rows: Different groups of users need access to different records European privacy laws Government security clearance Financial information restrictions 27
28 The workaround Split the original file; Use HDFS permissions to limit access Date/time Accnt # National Identifier 09:33: :33: :12: :22: :55: :22: :45: :03: :55: Asset Trade Broker ABC Sell group1 TBT Buy group2 DEF Sell group3 INTC Buy group1 F Buy group1 UA Buy group3 XYZ Sell group2 TMV Buy group1 MA Buy group3 What if only some brokers in each group are allowed to see full IDs? Date/time Accnt # National Identifier 09:33: :22: :55: :03: Date/time Accnt # National Identifier 11:33: :45: Date/time Accnt # National Identifier 14:12: :22: :55: Asset Trade Broker ABC Sell group1 INTC Buy group1 F Buy group1 TMV Buy group1 Asset Trade Broker TBT Buy group2 XYZ Sell group2 Asset Trade Broker DEF Sell group3 UA Buy group3 MA Buy group3 28
29 The Solution Apply controls to the master data file Row, column, and sub-column (masking) controls Ability to enforce these across access paths What All Group 1 Brokers See: Date/time Accnt # National Identifier Asset Trade Broker 09:33: XXX-XX-9876 ABC Sell group1 09:22: :55: XXX-XX-2345 INTC Buy group XXX-XX-8765 F Buy group1 09:03: XXX-XX-5678 TMV Buy group1 29
30 Record Service (Beta) to Enforce Column and Rowlevel Rules Hbase Applications: Datameer, Platfora, etc Hadoop components: MR, Pig, Spark, Solr, Hive Server 2, Impala... RecordService AWS S3 Permissions specified by administrators (top-level and delegated) Rule: Allow managers to see National IDs. HDFS Sentry Service Sentry Permissions rules 30
31 Benefits of RecordService Security Fine-grained data permissions and enforcement across Hadoop Integration with Sentry for policy storage and implementation Interoperability Clients no longer need to be aware of on-disk format Single data access path means single place to implement and test file format related changes Transparently swap components above or below (ex. HDFS -> S3) Performance/Efficiency Performance boosted via Impala s optimized scanner, dynamic code generation, parquet implementation Use projections over original source datasets instead of making so many copies/subsets 31
32 Record Service Architecture 1 Request: - Objects to access - User info Response: - List of splits - Delegation token RecordServicePlanner HDFS NN Sentry Service Hive Metastore Client Client Task Client Task RecordServiceWorker Client Task 2 Job launches as normal 3 Client tasks read records from RecordServiceWorker HDFS DN HBase RS S3 Not yet supported 32
33 Enforcing Sentry Permissions for MR/Spark Create a view in HMS with the necessary column/row restrictions Create a role and assign to a group CREATE VIEW nation_names AS SELECT n_nationkey, n_name FROM tpch.nation; CREATE ROLE demorole; GRANT ROLE demorole to GROUP demogroup; Grant access privilege to that role GRANT SELECT ON TABLE tpch.nation_names TO ROLE demorole; 33
34 Spark Usage Example: RDD Import Record Service package scala> import com.cloudera.recordservice.spark._; import com.cloudera.recordservice.spark._ Read data into a variable using Record Service API scala> val data = sc.recordservicerecords("select * from tpch.nation_names"); data: org.apache.spark.rdd.rdd[array[org.apache.hadoop.io.writable]] = RecordServiceRDD[0] at RDD at RecordServiceRDDBase.scala:57 Perform an action scala> data.count(); res0: Long = 25 34
35 Current Feature Availability Compute: SupportforMR (InputFormat) and Spark (RDDs, SparkSQLDataFrames) Storage: Support for reading HDFS or S3 of file format: Parquet, Text, Sequence File, RC, Avro Data Types: INT (8-64 bits), CHAR/VARCHAR, BOOL, FLOAT, DOUBLE, DECIMAL, STRING, TIMESTAMP No support for LOBs or Nested Types Scalability: Tested up to 80 large/powerful nodes Validated against 1 trillion row (100TB) TeraSort dataset/workload Metadata up to 1M blocks (planning only) Note that TPC-DS run on SparkSQL at 500GB scale point ran 15% faster with Record Service Security: Authentication: Kerberos / LDAP / AD Authorization: Sentry table level privileges, column and row-level privileges using HMS views. Delegation token + task encryption for secure task execution 35
36 Current Limitations Security Limitations Only supports simple single-table views (no joins or aggregations). SSL support has not been tested. Oozie integration has not been tested. UDFs are not supported Storage/File Format Limitations No support for write path. Unable to read from Kudu or HBase. Operation and Administration Limitations No diagnostic bundle support. No metrics available in CM. Application Integration Limitations Spark DataFramenot well tested. See in 36
37 Installation and Platform Support Installation Support CSD installation on CDH5.4+ Parcels, via CM Packages QuickStart VM Client JARs Platform/Hardware Support Server support: RHEL5-7, Ubuntu LTS, SLES, Debian Intel Nehalem (or later) or AMD Bulldozer (or later) processor 64GB memory For optimal performance, run with 12 or more disks or use SSD. Operation and Administration Directly from RecordService: Metrics exposed via a RecordService webapp Profiles for requests via RecordService webapp From CM: Basic service management (start/stop/restart) and basic health checks via CM (process availability). Ability to deploy RecordService Planner, Worker, or Planner+Worker roles. 37
38 Resources RecordService Beta Docs Feature list RecordService Source Code RecordServiceClient libraries 38
39 Thank you 39
Enabling Secure Hadoop Environments
Enabling Secure Hadoop Environments Fred Koopmans Sr. Director of Product Management 1 The future of government is data management What s your strategy? 2 Cloudera s Enterprise Data Hub makes it possible
More informationBig Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours
Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals
More informationCloudera Kudu Introduction
Cloudera Kudu Introduction Zbigniew Baranowski Based on: http://slideshare.net/cloudera/kudu-new-hadoop-storage-for-fast-analytics-onfast-data What is KUDU? New storage engine for structured data (tables)
More informationApril Copyright 2013 Cloudera Inc. All rights reserved.
Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and the Virtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here April 2014 Analytic Workloads on
More informationApache Kudu. Zbigniew Baranowski
Apache Kudu Zbigniew Baranowski Intro What is KUDU? New storage engine for structured data (tables) does not use HDFS! Columnar store Mutable (insert, update, delete) Written in C++ Apache-licensed open
More informationTime Series Storage with Apache Kudu (incubating)
Time Series Storage with Apache Kudu (incubating) Dan Burkert (Committer) dan@cloudera.com @danburkert Tweet about this talk: @getkudu or #kudu 1 Time Series machine metrics event logs sensor telemetry
More informationMODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS
MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANT SCALE www.elephantscale.com sujee@elephantscale.com HI, I M SUJEE MANIYAM Founder / Principal @ ElephantScale
More informationHadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved
Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop
More informationHadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)
Bigdata Fundamentals Day1: (2hours) 1. Understanding BigData. a. What is Big Data? b. Big-Data characteristics. c. Challenges with the traditional Data Base Systems and Distributed Systems. 2. Distributions:
More informationDelving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture
Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Hadoop 1.0 Architecture Introduction to Hadoop & Big Data Hadoop Evolution Hadoop Architecture Networking Concepts Use cases
More informationBig Data Hadoop Course Content
Big Data Hadoop Course Content Topics covered in the training Introduction to Linux and Big Data Virtual Machine ( VM) Introduction/ Installation of VirtualBox and the Big Data VM Introduction to Linux
More informationApache Kudu. A Distributed, Columnar Data Store for Fast Analytics. Mike Percy Software Engineer at Cloudera Apache Kudu PMC member
Apache Kudu A Distributed, Columnar Data Store for Fast Analytics Mike Percy Software Engineer at Cloudera Apache Kudu PMC member 1 Kudu Overview 2 Pace of Data Traditional Hadoop Storage Leaves a Gap
More informationCERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)
CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI) The Certificate in Software Development Life Cycle in BIGDATA, Business Intelligence and Tableau program
More informationBig Data Architect.
Big Data Architect www.austech.edu.au WHAT IS BIG DATA ARCHITECT? A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional
More informationHadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here
Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here 2013-11-12 Copyright 2013 Cloudera
More informationInnovatus Technologies
HADOOP 2.X BIGDATA ANALYTICS 1. Java Overview of Java Classes and Objects Garbage Collection and Modifiers Inheritance, Aggregation, Polymorphism Command line argument Abstract class and Interfaces String
More informationBig Data Analytics using Apache Hadoop and Spark with Scala
Big Data Analytics using Apache Hadoop and Spark with Scala Training Highlights : 80% of the training is with Practical Demo (On Custom Cloudera and Ubuntu Machines) 20% Theory Portion will be important
More informationConfiguring and Deploying Hadoop Cluster Deployment Templates
Configuring and Deploying Hadoop Cluster Deployment Templates This chapter contains the following sections: Hadoop Cluster Profile Templates, on page 1 Creating a Hadoop Cluster Profile Template, on page
More informationHadoop Development Introduction
Hadoop Development Introduction What is Bigdata? Evolution of Bigdata Types of Data and their Significance Need for Bigdata Analytics Why Bigdata with Hadoop? History of Hadoop Why Hadoop is in demand
More informationBlended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)
Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance
More informationMapR Enterprise Hadoop
2014 MapR Technologies 2014 MapR Technologies 1 MapR Enterprise Hadoop Top Ranked Cloud Leaders 500+ Customers 2014 MapR Technologies 2 Key MapR Advantage Partners Business Services APPLICATIONS & OS ANALYTICS
More informationBig Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture
Big Data Syllabus Hadoop YARN Setup Programming in YARN framework j Understanding big data and Hadoop Big Data Limitations and Solutions of existing Data Analytics Architecture Hadoop Features Hadoop Ecosystem
More informationmicrosoft
70-775.microsoft Number: 70-775 Passing Score: 800 Time Limit: 120 min Exam A QUESTION 1 Note: This question is part of a series of questions that present the same scenario. Each question in the series
More informationBig Data. Big Data Analyst. Big Data Engineer. Big Data Architect
Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION
More informationBig Data Hadoop Stack
Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware
More informationBig Data security, tools and tips to protect information assets
Big Data security, tools and tips to protect information assets Eddie Garcia Chief Security Architect 1 A Big Data Revolution is Happening as We Speak Industrial Revolution Data Revolution 2 The Benefits
More informationSecurity and Performance advances with Oracle Big Data SQL
Security and Performance advances with Oracle Big Data SQL Jean-Pierre Dijcks Oracle Redwood Shores, CA, USA Key Words SQL, Oracle, Database, Analytics, Object Store, Files, Big Data, Big Data SQL, Hadoop,
More informationOverview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::
Module Title Duration : Cloudera Data Analyst Training : 4 days Overview Take your knowledge to the next level Cloudera University s four-day data analyst training course will teach you to apply traditional
More informationWe are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info
We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423
More informationImpala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam
Impala A Modern, Open Source SQL Engine for Hadoop Yogesh Chockalingam Agenda Introduction Architecture Front End Back End Evaluation Comparison with Spark SQL Introduction Why not use Hive or HBase?
More informationBlended Learning Outline: Cloudera Data Analyst Training (171219a)
Blended Learning Outline: Cloudera Data Analyst Training (171219a) Cloudera Univeristy s data analyst training course will teach you to apply traditional data analytics and business intelligence skills
More informationOverview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::
Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized
More informationTopics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples
Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?
More informationThe Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou
The Hadoop Ecosystem EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca A lot of tools designed to work with Hadoop 2 HDFS, MapReduce Hadoop Distributed File System Core Hadoop component
More informationIntroduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data
Introduction to Hadoop High Availability Scaling Advantages and Challenges Introduction to Big Data What is Big data Big Data opportunities Big Data Challenges Characteristics of Big data Introduction
More informationImpala Intro. MingLi xunzhang
Impala Intro MingLi xunzhang Overview MPP SQL Query Engine for Hadoop Environment Designed for great performance BI Connected(ODBC/JDBC, Kerberos, LDAP, ANSI SQL) Hadoop Components HDFS, HBase, Metastore,
More informationLecture 11 Hadoop & Spark
Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem
More informationBig Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara
Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case
More informationOracle Big Data Fundamentals Ed 1
Oracle University Contact Us: +0097143909050 Oracle Big Data Fundamentals Ed 1 Duration: 5 Days What you will learn In the Oracle Big Data Fundamentals course, learn to use Oracle's Integrated Big Data
More informationEvolution of the Logging Service Hands-on Hadoop Proof of Concept for CALS-2.0
Evolution of the Logging Service Hands-on Hadoop Proof of Concept for CALS-2.0 Chris Roderick Marcin Sobieszek Piotr Sowinski Nikolay Tsvetkov Jakub Wozniak Courtesy IT-DB Agenda Intro to CALS System Hadoop
More informationOracle Big Data Fundamentals Ed 2
Oracle University Contact Us: 1.800.529.0165 Oracle Big Data Fundamentals Ed 2 Duration: 5 Days What you will learn In the Oracle Big Data Fundamentals course, you learn about big data, the technologies
More informationIntroduction to BigData, Hadoop:-
Introduction to BigData, Hadoop:- Big Data Introduction: Hadoop Introduction What is Hadoop? Why Hadoop? Hadoop History. Different types of Components in Hadoop? HDFS, MapReduce, PIG, Hive, SQOOP, HBASE,
More informationExam Questions
Exam Questions 70-775 Perform Data Engineering on Microsoft Azure HDInsight (beta) https://www.2passeasy.com/dumps/70-775/ NEW QUESTION 1 You are implementing a batch processing solution by using Azure
More informationProduct Compatibility Matrix
Compatibility Matrix Important tice (c) 2010-2014, Inc. All rights reserved., the logo, Impala, and any other product or service names or slogans contained in this document are trademarks of and its suppliers
More informationCloud Computing & Visualization
Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International
More informationIn-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet
In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta iorhian@gmail.com Radu Chilom radu.chilom@gmail.com Big data analytics / machine learning 6+ years
More informationHadoop. Introduction to BIGDATA and HADOOP
Hadoop Introduction to BIGDATA and HADOOP What is Big Data? What is Hadoop? Relation between Big Data and Hadoop What is the need of going ahead with Hadoop? Scenarios to apt Hadoop Technology in REAL
More informationHadoop. Introduction / Overview
Hadoop Introduction / Overview Preface We will use these PowerPoint slides to guide us through our topic. Expect 15 minute segments of lecture Expect 1-4 hour lab segments Expect minimal pretty pictures
More informationOracle Big Data Connectors
Oracle Big Data Connectors Oracle Big Data Connectors is a software suite that integrates processing in Apache Hadoop distributions with operations in Oracle Database. It enables the use of Hadoop to process
More informationMicrosoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo
Microsoft Exam Questions 70-775 Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo NEW QUESTION 1 You have an Azure HDInsight cluster. You need to store data in a file format that
More informationSQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism
Big Data and Hadoop with Azure HDInsight Andrew Brust Senior Director, Technical Product Marketing and Evangelism Datameer Level: Intermediate Meet Andrew Senior Director, Technical Product Marketing and
More informationBig Data with Hadoop Ecosystem
Diógenes Pires Big Data with Hadoop Ecosystem Hands-on (HBase, MySql and Hive + Power BI) Internet Live http://www.internetlivestats.com/ Introduction Business Intelligence Business Intelligence Process
More informationHADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation)
HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation) Introduction to BIGDATA and HADOOP What is Big Data? What is Hadoop? Relation between Big
More informationCloudera Introduction
Cloudera Introduction Important Notice 2010-2018 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, and any other product or service names or slogans contained in this document are trademarks
More informationCisco and Cloudera Deliver WorldClass Solutions for Powering the Enterprise Data Hub alerts, etc. Organizations need the right technology and infrastr
Solution Overview Cisco UCS Integrated Infrastructure for Big Data and Analytics with Cloudera Enterprise Bring faster performance and scalability for big data analytics. Highlights Proven platform for
More informationBig Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018
Big Data com Hadoop Impala, Hive e Spark VIII Sessão - SQL Bahia 03/03/2018 Diógenes Pires Connect with PASS Sign up for a free membership today at: pass.org #sqlpass Internet Live http://www.internetlivestats.com/
More informationHadoop An Overview. - Socrates CCDH
Hadoop An Overview - Socrates CCDH What is Big Data? Volume Not Gigabyte. Terabyte, Petabyte, Exabyte, Zettabyte - Due to handheld gadgets,and HD format images and videos - In total data, 90% of them collected
More informationHadoop course content
course content COURSE DETAILS 1. In-detail explanation on the concepts of HDFS & MapReduce frameworks 2. What is 2.X Architecture & How to set up Cluster 3. How to write complex MapReduce Programs 4. In-detail
More informationCertified Big Data Hadoop and Spark Scala Course Curriculum
Certified Big Data Hadoop and Spark Scala Course Curriculum The Certified Big Data Hadoop and Spark Scala course by DataFlair is a perfect blend of indepth theoretical knowledge and strong practical skills
More informationHadoop & Big Data Analytics Complete Practical & Real-time Training
An ISO Certified Training Institute A Unit of Sequelgate Innovative Technologies Pvt. Ltd. www.sqlschool.com Hadoop & Big Data Analytics Complete Practical & Real-time Training Mode : Instructor Led LIVE
More informationHadoop Online Training
Hadoop Online Training IQ training facility offers Hadoop Online Training. Our Hadoop trainers come with vast work experience and teaching skills. Our Hadoop training online is regarded as the one of the
More informationThings Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem Zohar Elkayam www.realdbamagic.com Twitter: @realmgic Who am I? Zohar Elkayam, CTO at Brillix Programmer, DBA, team leader, database trainer,
More informationCertified Big Data and Hadoop Course Curriculum
Certified Big Data and Hadoop Course Curriculum The Certified Big Data and Hadoop course by DataFlair is a perfect blend of in-depth theoretical knowledge and strong practical skills via implementation
More informationTechno Expert Solutions An institute for specialized studies!
Course Content of Big Data Hadoop( Intermediate+ Advance) Pre-requistes: knowledge of Core Java/ Oracle: Basic of Unix S.no Topics Date Status Introduction to Big Data & Hadoop Importance of Data& Data
More informationApache Hive for Oracle DBAs. Luís Marques
Apache Hive for Oracle DBAs Luís Marques About me Oracle ACE Alumnus Long time open source supporter Founder of Redglue (www.redglue.eu) works for @redgluept as Lead Data Architect @drune After this talk,
More informationStages of Data Processing
Data processing can be understood as the conversion of raw data into a meaningful and desired form. Basically, producing information that can be understood by the end user. So then, the question arises,
More informationTurning Relational Database Tables into Spark Data Sources
Turning Relational Database Tables into Spark Data Sources Kuassi Mensah Jean de Lavarene Director Product Mgmt Director Development Server Technologies October 04, 2017 3 Safe Harbor Statement The following
More information1 Big Data Hadoop. 1. Introduction About this Course About Big Data Course Logistics Introductions
Big Data Hadoop Architect Online Training (Big Data Hadoop + Apache Spark & Scala+ MongoDB Developer And Administrator + Apache Cassandra + Impala Training + Apache Kafka + Apache Storm) 1 Big Data Hadoop
More informationBacktesting with Spark
Backtesting with Spark Patrick Angeles, Cloudera Sandy Ryza, Cloudera Rick Carlin, Intel Sheetal Parade, Intel 1 Traditional Grid Shared storage Storage and compute scale independently Bottleneck on I/O
More informationAn Introduction to Big Data Formats
Introduction to Big Data Formats 1 An Introduction to Big Data Formats Understanding Avro, Parquet, and ORC WHITE PAPER Introduction to Big Data Formats 2 TABLE OF TABLE OF CONTENTS CONTENTS INTRODUCTION
More informationCloudera Introduction
Cloudera Introduction Important Notice 2010-2017 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, and any other product or service names or slogans contained in this document are trademarks
More informationHow Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,
How Apache Hadoop Complements Existing BI Systems Dr. Amr Awadallah Founder, CTO Cloudera, Inc. Twitter: @awadallah, @cloudera 2 The Problems with Current Data Systems BI Reports + Interactive Apps RDBMS
More informationCloudera Introduction
Cloudera Introduction Important Notice 2010-2017 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, and any other product or service names or slogans contained in this document are trademarks
More informationexam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0
70-775.exam Number: 70-775 Passing Score: 800 Time Limit: 120 min File Version: 1.0 Microsoft 70-775 Perform Data Engineering on Microsoft Azure HDInsight Version 1.0 Exam A QUESTION 1 You use YARN to
More informationMicrosoft Big Data and Hadoop
Microsoft Big Data and Hadoop Lara Rubbelke @sqlgal Cindy Gross @sqlcindy 2 The world of data is changing The 4Vs of Big Data http://nosql.mypopescu.com/post/9621746531/a-definition-of-big-data 3 Common
More informationCS November 2017
Bigtable Highly available distributed storage Distributed Systems 18. Bigtable Built with semi-structured data in mind URLs: content, metadata, links, anchors, page rank User data: preferences, account
More informationCloudera Manager Quick Start Guide
Cloudera Manager Guide Important Notice (c) 2010-2015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, and any other product or service names or slogans contained in this
More informationA Tutorial on Apache Spark
A Tutorial on Apache Spark A Practical Perspective By Harold Mitchell The Goal Learning Outcomes The Goal Learning Outcomes NOTE: The setup, installation, and examples assume Windows user Learn the following:
More informationHive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)
Hive and Shark Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 1 / 45 Motivation MapReduce is hard to
More informationApache HBase Andrew Purtell Committer, Apache HBase, Apache Software Foundation Big Data US Research And Development, Intel
Apache HBase 0.98 Andrew Purtell Committer, Apache HBase, Apache Software Foundation Big Data US Research And Development, Intel Who am I? Committer on the Apache HBase project Member of the Big Data Research
More informationCloudera Improvements in Apache Spark
Cloudera Improvements in Apache Spark Brian Baillod Sales Engineer 1 Agenda Introduc@on Spark One PlaCorm Ini@a@ve Spark Overview and Improvements Spark Proof of Concept Kudu and Record Service 2 Cloudera
More informationSpark, Shark and Spark Streaming Introduction
Spark, Shark and Spark Streaming Introduction Tushar Kale tusharkale@in.ibm.com June 2015 This Talk Introduction to Shark, Spark and Spark Streaming Architecture Deployment Methodology Performance References
More informationImportant Notice Cloudera, Inc. All rights reserved.
Apache Kudu Guide Important Notice 2010-2018 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, and any other product or service names or slogans contained in this document are trademarks
More informationUnifying Big Data Workloads in Apache Spark
Unifying Big Data Workloads in Apache Spark Hossein Falaki @mhfalaki Outline What s Apache Spark Why Unification Evolution of Unification Apache Spark + Databricks Q & A What s Apache Spark What is Apache
More informationHDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung
HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per
More informationEnabling Universal Authorization Models using Sentry
Enabling Universal Authorization Models using Sentry Hao Hao - hao.hao@cloudera.com Anne Yu - anneyu@cloudera.com Vancouver BC, Canada, May 9-12 2016 About us Software engineers at Cloudera Apache Sentry
More informationAsanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks
Asanka Padmakumara ETL 2.0: Data Engineering with Azure Databricks Who am I? Asanka Padmakumara Business Intelligence Consultant, More than 8 years in BI and Data Warehousing A regular speaker in data
More informationWHITEPAPER. MemSQL Enterprise Feature List
WHITEPAPER MemSQL Enterprise Feature List 2017 MemSQL Enterprise Feature List DEPLOYMENT Provision and deploy MemSQL anywhere according to your desired cluster configuration. On-Premises: Maximize infrastructure
More informationCmprssd Intrduction To
Cmprssd Intrduction To Hadoop, SQL-on-Hadoop, NoSQL Arseny.Chernov@Dell.com Singapore University of Technology & Design 2016-11-09 @arsenyspb Thank You For Inviting! My special kind regards to: Professor
More informationCloudera Introduction
Cloudera Introduction Important Notice 2010-2018 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, and any other product or service names or slogans contained in this document are trademarks
More informationElastify Cloud-Native Spark Application with PMEM. Junping Du --- Chief Architect, Tencent Cloud Big Data Department Yue Li --- Cofounder, MemVerge
Elastify Cloud-Native Spark Application with PMEM Junping Du --- Chief Architect, Tencent Cloud Big Data Department Yue Li --- Cofounder, MemVerge Table of Contents Sparkling: The Tencent Cloud Data Warehouse
More informationTechnical Sheet NITRODB Time-Series Database
Technical Sheet NITRODB Time-Series Database 10X Performance, 1/10th the Cost INTRODUCTION "#$#!%&''$!! NITRODB is an Apache Spark Based Time Series Database built to store and analyze 100s of terabytes
More informationSOLUTION TRACK Finding the Needle in a Big Data Innovator & Problem Solver Cloudera
SOLUTION TRACK Finding the Needle in a Big Data Haystack @EvaAndreasson, Innovator & Problem Solver Cloudera Agenda Problem (Solving) Apache Solr + Apache Hadoop et al Real-world examples Q&A Problem Solving
More informationCS November 2018
Bigtable Highly available distributed storage Distributed Systems 19. Bigtable Built with semi-structured data in mind URLs: content, metadata, links, anchors, page rank User data: preferences, account
More informationIncrease Value from Big Data with Real-Time Data Integration and Streaming Analytics
Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics Cy Erbay Senior Director Striim Executive Summary Striim is Uniquely Qualified to Solve the Challenges of Real-Time
More informationAbstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight
ESG Lab Review InterSystems Data Platform: A Unified, Efficient Data Platform for Fast Business Insight Date: April 218 Author: Kerry Dolan, Senior IT Validation Analyst Abstract Enterprise Strategy Group
More informationIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large
More informationOracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data
Oracle Big Data SQL Release 3.2 The unprecedented explosion in data that can be made useful to enterprises from the Internet of Things, to the social streams of global customer bases has created a tremendous
More informationCONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM
CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED PLATFORM Executive Summary Financial institutions have implemented and continue to implement many disparate applications
More informationOracle 1Z Oracle Big Data 2017 Implementation Essentials.
Oracle 1Z0-449 Oracle Big Data 2017 Implementation Essentials https://killexams.com/pass4sure/exam-detail/1z0-449 QUESTION: 63 Which three pieces of hardware are present on each node of the Big Data Appliance?
More information