Introducing Apache Kudu and RecordService (incubating)

Size: px
Start display at page:

Download "Introducing Apache Kudu and RecordService (incubating)"

Transcription

1 Introducing Apache Kudu and RecordService (incubating) Guido Oswald Sales Engineer, Switzerland April 2016, Swiss Big Data User Group Meetup 1

2 Current storage landscape in Hadoop HDFS excels at: Efficiently scanning large amounts of data Accumulating data with high throughput HBase excels at: Efficiently finding and writing individual rows Making data mutable Gaps exist when these properties are needed simultaneously 2

3 Managing the gap (today) Code Complexity Manage flow and sync of data between HDFS and Hbase Monitoring and Security Managing consistent backups, security policies, monitoring and more is hard Performance Significant lag between arrival of Hbase data staging and time when data is available for analytics. 3

4 Changing hardware landscape Spinning disk -> solid state storage NAND flash: Up to 450k read 250k write IOPS, about 2GB/sec read and 1.5GB/sec write throughput, at a price of less than $3/GB and dropping 3D XPoint memory (1000x faster than NAND, cheaper than RAM) RAM is cheaper and more abundant: 64->128->256GB over last few years Takeaway 1: The next bottleneck is CPU, and current storage heavy applications weren t designed with CPU efficiency in mind Takeaway 2: Column stores are feasible for random access 4

5 Apache Kudu (Incubating) Storage for Fast Analytics on Fast Data BATCH Spark, Hive, Pig MapReduce PROCESS, ANALYZE, SERVE STREAM Spark RESOURCE MANAGEMENT YARN SQL Impala UNIFIED SERVICES SEARCH Solr SECURITY Sentry, RecordService SDK Kite New updating column store for Hadoop Simplifies the architecture for building analytic applications on changing data Designed for fast analytic performance Natively integrated with Hadoop FILESYSTEM HDFS STRUCTURED Sqoop RELATIONAL Kudu STORE INTEGRATE NoSQL HBase UNSTRUCTURED Kafka, Flume Donated as incubating project at Apache Software Foundation (November 17, 2015) Beta now available 5

6 Kudu design goals High throughput for big scans (columnar storage and replication) Goal: Within 2x of Parquet Low-latency for short accesses (primary key indexes and quorum design) Goal: 1ms read/write on SSD Database-like semantics (initially single-row ACID) Relational data model SQL query NoSQL style scan/insert/update (Java client) 6

7 Kudu basic design Apache-licensed open source software Structured data model Basic construct: tables Tables broken down into tablets (roughly equivalent to partitions) Architecture supports geographically disparate, active/active systems Not the initial design goal 7

8 What Kudu is not Not a SQL interface Just the storage layer BYOSQL Bring-your-own SQL Not a file system Data must have tabular structure Not an application that runs on HDFS An alternative, native Hadoop storage engine Not a replacement for HDFS or HBase Select the right storage for the right use case Cloudera will continue to support and invest in all three 8

9 Kudu data model Tables have a RDBMS-like schema Finite number of columns (unlike HBase/Cassandra) Types: BOOL, INT8/16/32/64, FLOAT, DOUBLE, STRING, BINARY, TIMESTAMP Some subset of columns make up a primary key Fast random reads/writes by primary key No secondary indexes (yet) Columnar layout on disk - Parquet Lazy materialization Encoding and compression options 9 9

10 Table partitioning Hash bucketing Distribute records by hash of partition column(s) N buckets leads to N tablets Range partitioning Distribute records by ranges of the partition column(s) N split keys leads to N tablets Can be a mix for different columns of the primary key 10

11 Consistency model Consistency and replication enforced by Raft consensus (similar to Paxos) Replication by operation not data Single-row transactions now Multi-row transactions later Geo-distributed replicas will be possible under strict time synchronization Techniques drawn from Google Spanner and others 11

12 Kudu interfaces NoSQL-style APIs Insert(), Update(), Delete(), Scan() Java and C++ now Python soon Integrations with MapReduce, Spark, and Impala No direct access to underlying Kudu tablet files Beta does not have authentication, authorization, encryption 12

13 Impala integration Opens up Kudu to JDBC/ODBC clients Intuitive way to get data into Kudu INSERT INTO kudu_table SELECT * FROM src_table; Additional commands UPDATE DELETE Efficient INSERT VALUES Runs on the Kudu C++ client 13

14 Performance characteristics Very CPU efficient Written in modern C++, uses specialized CPU instructions, JIT compilation with LLVM Latency dependent on storage hardware capabilities Expect sub-millisecond response on SSDs and upcoming technologies No garbage collection allows very large memory footprint with no pauses Bloom filters reduce the need for many disk accesses 14

15 Operating Kudu Easiest through Cloudera Manager integration Separate parcel for now Kudu is always compacting No minor vs. major compaction No compaction latency spikes Web UI is full of metrics and logs 15

16 Cluster layout One or multiple masters Only one in current beta Low CPU and memory impact One tablet server per worker node Can share disks with HDFS One SSD per worker node just for Kudu WAL can speed up writes No dependencies on other Hadoop ecosystem components But interfacing components like Impala or Spark do 16

17 Real-time analytics in Hadoop today Merging in new data = storage complexity Incoming Data (Messaging System) HDFS + Impala Downsides: Multiple storage layers Have we accumulated enough data? Reorganize HBase file into Parquet HBase Parquet File Historic Data Most Recent Partition New Partition Wait for running operations to complete Define new Impala partition referencing the newly written Parquet file Reporting Request Latest data is hidden Files are messy Complex to do updates without breaking running queries 17

18 Real-time analytics in Hadoop with Kudu Incoming Data (Messaging System) Kudu + Impala Improvements: One system to operate Historical and Real-time Data Reporting Request No schedules or background processes Handle late arrivals or data corrections with ease New data available immediately for analytics or operations 18

19 Kudu for data warehousing Near real time data visibility BI tools can display events that happened seconds earlier Excellent for star schemas Fast scans of deep fact tables Efficient wide fact tables Simplified updates of slowly changing dimensions 19

20 Near real time data warehousing on Kudu Simple Files FLUME HUE RDBMS K A F K A K U D U IMPALA User Streams SPARK STREAMING Complex BI tools 20

21 Resources Join the community Download the beta cloudera.com/downloads Read the whitepaper getkudu.io/kudu.pdf 21

22 Creating a Kudu table Table name in Impala does NOT match table name in Kudu. Kudu is its own storage layer. Kudu Storage handler Kudu Master hostname and port A primary key is mandatory 22

23 Spark (Scala) code DataFrame Row Kudu table name Kudu Master hostname and port Create a client, session and table object Extract values from the row, strong types Create an insert object and row Perform the actual insert Cleanup Set the values by type, column name and column valule 23

24 Kudu code examples and docs 0/topics/kudu_development.html 24

25 RecordService 25

26 Permission Enforcement today with Sentry Rule: Allow fraud analysts read access to the transaction table Sentry Enforcement Hive Server 2 Admins specify permissions Sentry Service Sentry Permissions rules Coarse grained (table) Sentry Enforcement Sentry Enforcement Impala HDFS: MR, Pig, Spark,... Apps: Datameer, Platfora, Zoomdata, etc Sentry Enforcement Search (Solr) 26

27 The Need for Fine-Grained Access Control Across all access paths Columns: Sensitive column visibility varies; Example: credit card numbers Managers: Call Centre: XXXX XXXX XXXX 5678 Analysts: XXXX XXXX XXXX XXXX Others: Does not see credit card column Rows: Different groups of users need access to different records European privacy laws Government security clearance Financial information restrictions 27

28 The workaround Split the original file; Use HDFS permissions to limit access Date/time Accnt # National Identifier 09:33: :33: :12: :22: :55: :22: :45: :03: :55: Asset Trade Broker ABC Sell group1 TBT Buy group2 DEF Sell group3 INTC Buy group1 F Buy group1 UA Buy group3 XYZ Sell group2 TMV Buy group1 MA Buy group3 What if only some brokers in each group are allowed to see full IDs? Date/time Accnt # National Identifier 09:33: :22: :55: :03: Date/time Accnt # National Identifier 11:33: :45: Date/time Accnt # National Identifier 14:12: :22: :55: Asset Trade Broker ABC Sell group1 INTC Buy group1 F Buy group1 TMV Buy group1 Asset Trade Broker TBT Buy group2 XYZ Sell group2 Asset Trade Broker DEF Sell group3 UA Buy group3 MA Buy group3 28

29 The Solution Apply controls to the master data file Row, column, and sub-column (masking) controls Ability to enforce these across access paths What All Group 1 Brokers See: Date/time Accnt # National Identifier Asset Trade Broker 09:33: XXX-XX-9876 ABC Sell group1 09:22: :55: XXX-XX-2345 INTC Buy group XXX-XX-8765 F Buy group1 09:03: XXX-XX-5678 TMV Buy group1 29

30 Record Service (Beta) to Enforce Column and Rowlevel Rules Hbase Applications: Datameer, Platfora, etc Hadoop components: MR, Pig, Spark, Solr, Hive Server 2, Impala... RecordService AWS S3 Permissions specified by administrators (top-level and delegated) Rule: Allow managers to see National IDs. HDFS Sentry Service Sentry Permissions rules 30

31 Benefits of RecordService Security Fine-grained data permissions and enforcement across Hadoop Integration with Sentry for policy storage and implementation Interoperability Clients no longer need to be aware of on-disk format Single data access path means single place to implement and test file format related changes Transparently swap components above or below (ex. HDFS -> S3) Performance/Efficiency Performance boosted via Impala s optimized scanner, dynamic code generation, parquet implementation Use projections over original source datasets instead of making so many copies/subsets 31

32 Record Service Architecture 1 Request: - Objects to access - User info Response: - List of splits - Delegation token RecordServicePlanner HDFS NN Sentry Service Hive Metastore Client Client Task Client Task RecordServiceWorker Client Task 2 Job launches as normal 3 Client tasks read records from RecordServiceWorker HDFS DN HBase RS S3 Not yet supported 32

33 Enforcing Sentry Permissions for MR/Spark Create a view in HMS with the necessary column/row restrictions Create a role and assign to a group CREATE VIEW nation_names AS SELECT n_nationkey, n_name FROM tpch.nation; CREATE ROLE demorole; GRANT ROLE demorole to GROUP demogroup; Grant access privilege to that role GRANT SELECT ON TABLE tpch.nation_names TO ROLE demorole; 33

34 Spark Usage Example: RDD Import Record Service package scala> import com.cloudera.recordservice.spark._; import com.cloudera.recordservice.spark._ Read data into a variable using Record Service API scala> val data = sc.recordservicerecords("select * from tpch.nation_names"); data: org.apache.spark.rdd.rdd[array[org.apache.hadoop.io.writable]] = RecordServiceRDD[0] at RDD at RecordServiceRDDBase.scala:57 Perform an action scala> data.count(); res0: Long = 25 34

35 Current Feature Availability Compute: SupportforMR (InputFormat) and Spark (RDDs, SparkSQLDataFrames) Storage: Support for reading HDFS or S3 of file format: Parquet, Text, Sequence File, RC, Avro Data Types: INT (8-64 bits), CHAR/VARCHAR, BOOL, FLOAT, DOUBLE, DECIMAL, STRING, TIMESTAMP No support for LOBs or Nested Types Scalability: Tested up to 80 large/powerful nodes Validated against 1 trillion row (100TB) TeraSort dataset/workload Metadata up to 1M blocks (planning only) Note that TPC-DS run on SparkSQL at 500GB scale point ran 15% faster with Record Service Security: Authentication: Kerberos / LDAP / AD Authorization: Sentry table level privileges, column and row-level privileges using HMS views. Delegation token + task encryption for secure task execution 35

36 Current Limitations Security Limitations Only supports simple single-table views (no joins or aggregations). SSL support has not been tested. Oozie integration has not been tested. UDFs are not supported Storage/File Format Limitations No support for write path. Unable to read from Kudu or HBase. Operation and Administration Limitations No diagnostic bundle support. No metrics available in CM. Application Integration Limitations Spark DataFramenot well tested. See in 36

37 Installation and Platform Support Installation Support CSD installation on CDH5.4+ Parcels, via CM Packages QuickStart VM Client JARs Platform/Hardware Support Server support: RHEL5-7, Ubuntu LTS, SLES, Debian Intel Nehalem (or later) or AMD Bulldozer (or later) processor 64GB memory For optimal performance, run with 12 or more disks or use SSD. Operation and Administration Directly from RecordService: Metrics exposed via a RecordService webapp Profiles for requests via RecordService webapp From CM: Basic service management (start/stop/restart) and basic health checks via CM (process availability). Ability to deploy RecordService Planner, Worker, or Planner+Worker roles. 37

38 Resources RecordService Beta Docs Feature list RecordService Source Code RecordServiceClient libraries 38

39 Thank you 39

Enabling Secure Hadoop Environments

Enabling Secure Hadoop Environments Enabling Secure Hadoop Environments Fred Koopmans Sr. Director of Product Management 1 The future of government is data management What s your strategy? 2 Cloudera s Enterprise Data Hub makes it possible

More information

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals

More information

Cloudera Kudu Introduction

Cloudera Kudu Introduction Cloudera Kudu Introduction Zbigniew Baranowski Based on: http://slideshare.net/cloudera/kudu-new-hadoop-storage-for-fast-analytics-onfast-data What is KUDU? New storage engine for structured data (tables)

More information

April Copyright 2013 Cloudera Inc. All rights reserved.

April Copyright 2013 Cloudera Inc. All rights reserved. Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and the Virtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here April 2014 Analytic Workloads on

More information

Apache Kudu. Zbigniew Baranowski

Apache Kudu. Zbigniew Baranowski Apache Kudu Zbigniew Baranowski Intro What is KUDU? New storage engine for structured data (tables) does not use HDFS! Columnar store Mutable (insert, update, delete) Written in C++ Apache-licensed open

More information

Time Series Storage with Apache Kudu (incubating)

Time Series Storage with Apache Kudu (incubating) Time Series Storage with Apache Kudu (incubating) Dan Burkert (Committer) dan@cloudera.com @danburkert Tweet about this talk: @getkudu or #kudu 1 Time Series machine metrics event logs sensor telemetry

More information

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANT SCALE www.elephantscale.com sujee@elephantscale.com HI, I M SUJEE MANIYAM Founder / Principal @ ElephantScale

More information

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop

More information

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours) Bigdata Fundamentals Day1: (2hours) 1. Understanding BigData. a. What is Big Data? b. Big-Data characteristics. c. Challenges with the traditional Data Base Systems and Distributed Systems. 2. Distributions:

More information

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Hadoop 1.0 Architecture Introduction to Hadoop & Big Data Hadoop Evolution Hadoop Architecture Networking Concepts Use cases

More information

Big Data Hadoop Course Content

Big Data Hadoop Course Content Big Data Hadoop Course Content Topics covered in the training Introduction to Linux and Big Data Virtual Machine ( VM) Introduction/ Installation of VirtualBox and the Big Data VM Introduction to Linux

More information

Apache Kudu. A Distributed, Columnar Data Store for Fast Analytics. Mike Percy Software Engineer at Cloudera Apache Kudu PMC member

Apache Kudu. A Distributed, Columnar Data Store for Fast Analytics. Mike Percy Software Engineer at Cloudera Apache Kudu PMC member Apache Kudu A Distributed, Columnar Data Store for Fast Analytics Mike Percy Software Engineer at Cloudera Apache Kudu PMC member 1 Kudu Overview 2 Pace of Data Traditional Hadoop Storage Leaves a Gap

More information

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI) CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI) The Certificate in Software Development Life Cycle in BIGDATA, Business Intelligence and Tableau program

More information

Big Data Architect.

Big Data Architect. Big Data Architect www.austech.edu.au WHAT IS BIG DATA ARCHITECT? A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional

More information

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here 2013-11-12 Copyright 2013 Cloudera

More information

Innovatus Technologies

Innovatus Technologies HADOOP 2.X BIGDATA ANALYTICS 1. Java Overview of Java Classes and Objects Garbage Collection and Modifiers Inheritance, Aggregation, Polymorphism Command line argument Abstract class and Interfaces String

More information

Big Data Analytics using Apache Hadoop and Spark with Scala

Big Data Analytics using Apache Hadoop and Spark with Scala Big Data Analytics using Apache Hadoop and Spark with Scala Training Highlights : 80% of the training is with Practical Demo (On Custom Cloudera and Ubuntu Machines) 20% Theory Portion will be important

More information

Configuring and Deploying Hadoop Cluster Deployment Templates

Configuring and Deploying Hadoop Cluster Deployment Templates Configuring and Deploying Hadoop Cluster Deployment Templates This chapter contains the following sections: Hadoop Cluster Profile Templates, on page 1 Creating a Hadoop Cluster Profile Template, on page

More information

Hadoop Development Introduction

Hadoop Development Introduction Hadoop Development Introduction What is Bigdata? Evolution of Bigdata Types of Data and their Significance Need for Bigdata Analytics Why Bigdata with Hadoop? History of Hadoop Why Hadoop is in demand

More information

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance

More information

MapR Enterprise Hadoop

MapR Enterprise Hadoop 2014 MapR Technologies 2014 MapR Technologies 1 MapR Enterprise Hadoop Top Ranked Cloud Leaders 500+ Customers 2014 MapR Technologies 2 Key MapR Advantage Partners Business Services APPLICATIONS & OS ANALYTICS

More information

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture Big Data Syllabus Hadoop YARN Setup Programming in YARN framework j Understanding big data and Hadoop Big Data Limitations and Solutions of existing Data Analytics Architecture Hadoop Features Hadoop Ecosystem

More information

microsoft

microsoft 70-775.microsoft Number: 70-775 Passing Score: 800 Time Limit: 120 min Exam A QUESTION 1 Note: This question is part of a series of questions that present the same scenario. Each question in the series

More information

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION

More information

Big Data Hadoop Stack

Big Data Hadoop Stack Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware

More information

Big Data security, tools and tips to protect information assets

Big Data security, tools and tips to protect information assets Big Data security, tools and tips to protect information assets Eddie Garcia Chief Security Architect 1 A Big Data Revolution is Happening as We Speak Industrial Revolution Data Revolution 2 The Benefits

More information

Security and Performance advances with Oracle Big Data SQL

Security and Performance advances with Oracle Big Data SQL Security and Performance advances with Oracle Big Data SQL Jean-Pierre Dijcks Oracle Redwood Shores, CA, USA Key Words SQL, Oracle, Database, Analytics, Object Store, Files, Big Data, Big Data SQL, Hadoop,

More information

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training:: Module Title Duration : Cloudera Data Analyst Training : 4 days Overview Take your knowledge to the next level Cloudera University s four-day data analyst training course will teach you to apply traditional

More information

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423

More information

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam Impala A Modern, Open Source SQL Engine for Hadoop Yogesh Chockalingam Agenda Introduction Architecture Front End Back End Evaluation Comparison with Spark SQL Introduction Why not use Hive or HBase?

More information

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Blended Learning Outline: Cloudera Data Analyst Training (171219a) Blended Learning Outline: Cloudera Data Analyst Training (171219a) Cloudera Univeristy s data analyst training course will teach you to apply traditional data analytics and business intelligence skills

More information

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development:: Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou The Hadoop Ecosystem EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca A lot of tools designed to work with Hadoop 2 HDFS, MapReduce Hadoop Distributed File System Core Hadoop component

More information

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data Introduction to Hadoop High Availability Scaling Advantages and Challenges Introduction to Big Data What is Big data Big Data opportunities Big Data Challenges Characteristics of Big data Introduction

More information

Impala Intro. MingLi xunzhang

Impala Intro. MingLi xunzhang Impala Intro MingLi xunzhang Overview MPP SQL Query Engine for Hadoop Environment Designed for great performance BI Connected(ODBC/JDBC, Kerberos, LDAP, ANSI SQL) Hadoop Components HDFS, HBase, Metastore,

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case

More information

Oracle Big Data Fundamentals Ed 1

Oracle Big Data Fundamentals Ed 1 Oracle University Contact Us: +0097143909050 Oracle Big Data Fundamentals Ed 1 Duration: 5 Days What you will learn In the Oracle Big Data Fundamentals course, learn to use Oracle's Integrated Big Data

More information

Evolution of the Logging Service Hands-on Hadoop Proof of Concept for CALS-2.0

Evolution of the Logging Service Hands-on Hadoop Proof of Concept for CALS-2.0 Evolution of the Logging Service Hands-on Hadoop Proof of Concept for CALS-2.0 Chris Roderick Marcin Sobieszek Piotr Sowinski Nikolay Tsvetkov Jakub Wozniak Courtesy IT-DB Agenda Intro to CALS System Hadoop

More information

Oracle Big Data Fundamentals Ed 2

Oracle Big Data Fundamentals Ed 2 Oracle University Contact Us: 1.800.529.0165 Oracle Big Data Fundamentals Ed 2 Duration: 5 Days What you will learn In the Oracle Big Data Fundamentals course, you learn about big data, the technologies

More information

Introduction to BigData, Hadoop:-

Introduction to BigData, Hadoop:- Introduction to BigData, Hadoop:- Big Data Introduction: Hadoop Introduction What is Hadoop? Why Hadoop? Hadoop History. Different types of Components in Hadoop? HDFS, MapReduce, PIG, Hive, SQOOP, HBASE,

More information

Exam Questions

Exam Questions Exam Questions 70-775 Perform Data Engineering on Microsoft Azure HDInsight (beta) https://www.2passeasy.com/dumps/70-775/ NEW QUESTION 1 You are implementing a batch processing solution by using Azure

More information

Product Compatibility Matrix

Product Compatibility Matrix Compatibility Matrix Important tice (c) 2010-2014, Inc. All rights reserved., the logo, Impala, and any other product or service names or slogans contained in this document are trademarks of and its suppliers

More information

Cloud Computing & Visualization

Cloud Computing & Visualization Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International

More information

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta iorhian@gmail.com Radu Chilom radu.chilom@gmail.com Big data analytics / machine learning 6+ years

More information

Hadoop. Introduction to BIGDATA and HADOOP

Hadoop. Introduction to BIGDATA and HADOOP Hadoop Introduction to BIGDATA and HADOOP What is Big Data? What is Hadoop? Relation between Big Data and Hadoop What is the need of going ahead with Hadoop? Scenarios to apt Hadoop Technology in REAL

More information

Hadoop. Introduction / Overview

Hadoop. Introduction / Overview Hadoop Introduction / Overview Preface We will use these PowerPoint slides to guide us through our topic. Expect 15 minute segments of lecture Expect 1-4 hour lab segments Expect minimal pretty pictures

More information

Oracle Big Data Connectors

Oracle Big Data Connectors Oracle Big Data Connectors Oracle Big Data Connectors is a software suite that integrates processing in Apache Hadoop distributions with operations in Oracle Database. It enables the use of Hadoop to process

More information

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo Microsoft Exam Questions 70-775 Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo NEW QUESTION 1 You have an Azure HDInsight cluster. You need to store data in a file format that

More information

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism Big Data and Hadoop with Azure HDInsight Andrew Brust Senior Director, Technical Product Marketing and Evangelism Datameer Level: Intermediate Meet Andrew Senior Director, Technical Product Marketing and

More information

Big Data with Hadoop Ecosystem

Big Data with Hadoop Ecosystem Diógenes Pires Big Data with Hadoop Ecosystem Hands-on (HBase, MySql and Hive + Power BI) Internet Live http://www.internetlivestats.com/ Introduction Business Intelligence Business Intelligence Process

More information

HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation)

HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation) HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation) Introduction to BIGDATA and HADOOP What is Big Data? What is Hadoop? Relation between Big

More information

Cloudera Introduction

Cloudera Introduction Cloudera Introduction Important Notice 2010-2018 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, and any other product or service names or slogans contained in this document are trademarks

More information

Cisco and Cloudera Deliver WorldClass Solutions for Powering the Enterprise Data Hub alerts, etc. Organizations need the right technology and infrastr

Cisco and Cloudera Deliver WorldClass Solutions for Powering the Enterprise Data Hub alerts, etc. Organizations need the right technology and infrastr Solution Overview Cisco UCS Integrated Infrastructure for Big Data and Analytics with Cloudera Enterprise Bring faster performance and scalability for big data analytics. Highlights Proven platform for

More information

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018 Big Data com Hadoop Impala, Hive e Spark VIII Sessão - SQL Bahia 03/03/2018 Diógenes Pires Connect with PASS Sign up for a free membership today at: pass.org #sqlpass Internet Live http://www.internetlivestats.com/

More information

Hadoop An Overview. - Socrates CCDH

Hadoop An Overview. - Socrates CCDH Hadoop An Overview - Socrates CCDH What is Big Data? Volume Not Gigabyte. Terabyte, Petabyte, Exabyte, Zettabyte - Due to handheld gadgets,and HD format images and videos - In total data, 90% of them collected

More information

Hadoop course content

Hadoop course content course content COURSE DETAILS 1. In-detail explanation on the concepts of HDFS & MapReduce frameworks 2. What is 2.X Architecture & How to set up Cluster 3. How to write complex MapReduce Programs 4. In-detail

More information

Certified Big Data Hadoop and Spark Scala Course Curriculum

Certified Big Data Hadoop and Spark Scala Course Curriculum Certified Big Data Hadoop and Spark Scala Course Curriculum The Certified Big Data Hadoop and Spark Scala course by DataFlair is a perfect blend of indepth theoretical knowledge and strong practical skills

More information

Hadoop & Big Data Analytics Complete Practical & Real-time Training

Hadoop & Big Data Analytics Complete Practical & Real-time Training An ISO Certified Training Institute A Unit of Sequelgate Innovative Technologies Pvt. Ltd. www.sqlschool.com Hadoop & Big Data Analytics Complete Practical & Real-time Training Mode : Instructor Led LIVE

More information

Hadoop Online Training

Hadoop Online Training Hadoop Online Training IQ training facility offers Hadoop Online Training. Our Hadoop trainers come with vast work experience and teaching skills. Our Hadoop training online is regarded as the one of the

More information

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem Zohar Elkayam www.realdbamagic.com Twitter: @realmgic Who am I? Zohar Elkayam, CTO at Brillix Programmer, DBA, team leader, database trainer,

More information

Certified Big Data and Hadoop Course Curriculum

Certified Big Data and Hadoop Course Curriculum Certified Big Data and Hadoop Course Curriculum The Certified Big Data and Hadoop course by DataFlair is a perfect blend of in-depth theoretical knowledge and strong practical skills via implementation

More information

Techno Expert Solutions An institute for specialized studies!

Techno Expert Solutions An institute for specialized studies! Course Content of Big Data Hadoop( Intermediate+ Advance) Pre-requistes: knowledge of Core Java/ Oracle: Basic of Unix S.no Topics Date Status Introduction to Big Data & Hadoop Importance of Data& Data

More information

Apache Hive for Oracle DBAs. Luís Marques

Apache Hive for Oracle DBAs. Luís Marques Apache Hive for Oracle DBAs Luís Marques About me Oracle ACE Alumnus Long time open source supporter Founder of Redglue (www.redglue.eu) works for @redgluept as Lead Data Architect @drune After this talk,

More information

Stages of Data Processing

Stages of Data Processing Data processing can be understood as the conversion of raw data into a meaningful and desired form. Basically, producing information that can be understood by the end user. So then, the question arises,

More information

Turning Relational Database Tables into Spark Data Sources

Turning Relational Database Tables into Spark Data Sources Turning Relational Database Tables into Spark Data Sources Kuassi Mensah Jean de Lavarene Director Product Mgmt Director Development Server Technologies October 04, 2017 3 Safe Harbor Statement The following

More information

1 Big Data Hadoop. 1. Introduction About this Course About Big Data Course Logistics Introductions

1 Big Data Hadoop. 1. Introduction About this Course About Big Data Course Logistics Introductions Big Data Hadoop Architect Online Training (Big Data Hadoop + Apache Spark & Scala+ MongoDB Developer And Administrator + Apache Cassandra + Impala Training + Apache Kafka + Apache Storm) 1 Big Data Hadoop

More information

Backtesting with Spark

Backtesting with Spark Backtesting with Spark Patrick Angeles, Cloudera Sandy Ryza, Cloudera Rick Carlin, Intel Sheetal Parade, Intel 1 Traditional Grid Shared storage Storage and compute scale independently Bottleneck on I/O

More information

An Introduction to Big Data Formats

An Introduction to Big Data Formats Introduction to Big Data Formats 1 An Introduction to Big Data Formats Understanding Avro, Parquet, and ORC WHITE PAPER Introduction to Big Data Formats 2 TABLE OF TABLE OF CONTENTS CONTENTS INTRODUCTION

More information

Cloudera Introduction

Cloudera Introduction Cloudera Introduction Important Notice 2010-2017 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, and any other product or service names or slogans contained in this document are trademarks

More information

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera, How Apache Hadoop Complements Existing BI Systems Dr. Amr Awadallah Founder, CTO Cloudera, Inc. Twitter: @awadallah, @cloudera 2 The Problems with Current Data Systems BI Reports + Interactive Apps RDBMS

More information

Cloudera Introduction

Cloudera Introduction Cloudera Introduction Important Notice 2010-2017 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, and any other product or service names or slogans contained in this document are trademarks

More information

exam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0

exam.   Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0 70-775.exam Number: 70-775 Passing Score: 800 Time Limit: 120 min File Version: 1.0 Microsoft 70-775 Perform Data Engineering on Microsoft Azure HDInsight Version 1.0 Exam A QUESTION 1 You use YARN to

More information

Microsoft Big Data and Hadoop

Microsoft Big Data and Hadoop Microsoft Big Data and Hadoop Lara Rubbelke @sqlgal Cindy Gross @sqlcindy 2 The world of data is changing The 4Vs of Big Data http://nosql.mypopescu.com/post/9621746531/a-definition-of-big-data 3 Common

More information

CS November 2017

CS November 2017 Bigtable Highly available distributed storage Distributed Systems 18. Bigtable Built with semi-structured data in mind URLs: content, metadata, links, anchors, page rank User data: preferences, account

More information

Cloudera Manager Quick Start Guide

Cloudera Manager Quick Start Guide Cloudera Manager Guide Important Notice (c) 2010-2015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, and any other product or service names or slogans contained in this

More information

A Tutorial on Apache Spark

A Tutorial on Apache Spark A Tutorial on Apache Spark A Practical Perspective By Harold Mitchell The Goal Learning Outcomes The Goal Learning Outcomes NOTE: The setup, installation, and examples assume Windows user Learn the following:

More information

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic) Hive and Shark Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 1 / 45 Motivation MapReduce is hard to

More information

Apache HBase Andrew Purtell Committer, Apache HBase, Apache Software Foundation Big Data US Research And Development, Intel

Apache HBase Andrew Purtell Committer, Apache HBase, Apache Software Foundation Big Data US Research And Development, Intel Apache HBase 0.98 Andrew Purtell Committer, Apache HBase, Apache Software Foundation Big Data US Research And Development, Intel Who am I? Committer on the Apache HBase project Member of the Big Data Research

More information

Cloudera Improvements in Apache Spark

Cloudera Improvements in Apache Spark Cloudera Improvements in Apache Spark Brian Baillod Sales Engineer 1 Agenda Introduc@on Spark One PlaCorm Ini@a@ve Spark Overview and Improvements Spark Proof of Concept Kudu and Record Service 2 Cloudera

More information

Spark, Shark and Spark Streaming Introduction

Spark, Shark and Spark Streaming Introduction Spark, Shark and Spark Streaming Introduction Tushar Kale tusharkale@in.ibm.com June 2015 This Talk Introduction to Shark, Spark and Spark Streaming Architecture Deployment Methodology Performance References

More information

Important Notice Cloudera, Inc. All rights reserved.

Important Notice Cloudera, Inc. All rights reserved. Apache Kudu Guide Important Notice 2010-2018 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, and any other product or service names or slogans contained in this document are trademarks

More information

Unifying Big Data Workloads in Apache Spark

Unifying Big Data Workloads in Apache Spark Unifying Big Data Workloads in Apache Spark Hossein Falaki @mhfalaki Outline What s Apache Spark Why Unification Evolution of Unification Apache Spark + Databricks Q & A What s Apache Spark What is Apache

More information

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per

More information

Enabling Universal Authorization Models using Sentry

Enabling Universal Authorization Models using Sentry Enabling Universal Authorization Models using Sentry Hao Hao - hao.hao@cloudera.com Anne Yu - anneyu@cloudera.com Vancouver BC, Canada, May 9-12 2016 About us Software engineers at Cloudera Apache Sentry

More information

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks Asanka Padmakumara ETL 2.0: Data Engineering with Azure Databricks Who am I? Asanka Padmakumara Business Intelligence Consultant, More than 8 years in BI and Data Warehousing A regular speaker in data

More information

WHITEPAPER. MemSQL Enterprise Feature List

WHITEPAPER. MemSQL Enterprise Feature List WHITEPAPER MemSQL Enterprise Feature List 2017 MemSQL Enterprise Feature List DEPLOYMENT Provision and deploy MemSQL anywhere according to your desired cluster configuration. On-Premises: Maximize infrastructure

More information

Cmprssd Intrduction To

Cmprssd Intrduction To Cmprssd Intrduction To Hadoop, SQL-on-Hadoop, NoSQL Arseny.Chernov@Dell.com Singapore University of Technology & Design 2016-11-09 @arsenyspb Thank You For Inviting! My special kind regards to: Professor

More information

Cloudera Introduction

Cloudera Introduction Cloudera Introduction Important Notice 2010-2018 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, and any other product or service names or slogans contained in this document are trademarks

More information

Elastify Cloud-Native Spark Application with PMEM. Junping Du --- Chief Architect, Tencent Cloud Big Data Department Yue Li --- Cofounder, MemVerge

Elastify Cloud-Native Spark Application with PMEM. Junping Du --- Chief Architect, Tencent Cloud Big Data Department Yue Li --- Cofounder, MemVerge Elastify Cloud-Native Spark Application with PMEM Junping Du --- Chief Architect, Tencent Cloud Big Data Department Yue Li --- Cofounder, MemVerge Table of Contents Sparkling: The Tencent Cloud Data Warehouse

More information

Technical Sheet NITRODB Time-Series Database

Technical Sheet NITRODB Time-Series Database Technical Sheet NITRODB Time-Series Database 10X Performance, 1/10th the Cost INTRODUCTION "#$#!%&''$!! NITRODB is an Apache Spark Based Time Series Database built to store and analyze 100s of terabytes

More information

SOLUTION TRACK Finding the Needle in a Big Data Innovator & Problem Solver Cloudera

SOLUTION TRACK Finding the Needle in a Big Data Innovator & Problem Solver Cloudera SOLUTION TRACK Finding the Needle in a Big Data Haystack @EvaAndreasson, Innovator & Problem Solver Cloudera Agenda Problem (Solving) Apache Solr + Apache Hadoop et al Real-world examples Q&A Problem Solving

More information

CS November 2018

CS November 2018 Bigtable Highly available distributed storage Distributed Systems 19. Bigtable Built with semi-structured data in mind URLs: content, metadata, links, anchors, page rank User data: preferences, account

More information

Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics

Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics Cy Erbay Senior Director Striim Executive Summary Striim is Uniquely Qualified to Solve the Challenges of Real-Time

More information

Abstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight

Abstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight ESG Lab Review InterSystems Data Platform: A Unified, Efficient Data Platform for Fast Business Insight Date: April 218 Author: Kerry Dolan, Senior IT Validation Analyst Abstract Enterprise Strategy Group

More information

Introduction to Hadoop and MapReduce

Introduction to Hadoop and MapReduce Introduction to Hadoop and MapReduce Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large

More information

Oracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data

Oracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data Oracle Big Data SQL Release 3.2 The unprecedented explosion in data that can be made useful to enterprises from the Internet of Things, to the social streams of global customer bases has created a tremendous

More information

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED PLATFORM Executive Summary Financial institutions have implemented and continue to implement many disparate applications

More information

Oracle 1Z Oracle Big Data 2017 Implementation Essentials.

Oracle 1Z Oracle Big Data 2017 Implementation Essentials. Oracle 1Z0-449 Oracle Big Data 2017 Implementation Essentials https://killexams.com/pass4sure/exam-detail/1z0-449 QUESTION: 63 Which three pieces of hardware are present on each node of the Big Data Appliance?

More information