Cloudera Improvements in Apache Spark

Size: px
Start display at page:

Download "Cloudera Improvements in Apache Spark"

Transcription

1 Cloudera Improvements in Apache Spark Brian Baillod Sales Engineer 1

2 Agenda Spark One PlaCorm Spark Overview and Improvements Spark Proof of Concept Kudu and Record Service 2

3 Cloudera company snapshot Founded 2008, by former employees of Employees Today 900+ worldwide World Class Support More than 75 24x7 global staff Cloudera University Over 40,000 trained We help code Hadoop Cloudera employees are leading developers & contributors to the complete Apache Hadoop ecosystem of projects We help fix Hadoop Cloudera fixed 60% of all Hadoop JIRA bugs 3

4 Hadoop Categories of Hadoop adop/on Free/Developer Over 2.5 million downloads Subscrip/on Training 60% of Fortune 100 a`ended Cloudera training, over 40,000 trained since 2009 Business Need Training Free/ Developer Services & Support Big Data Maturity Service & Support 9/10 for support ability to solve technical issues #1 recommenda/on Over 2x revenue of nearest 90% renewal rate 4

5 What is Spark Fast general purpose processing engine for large data Provides API s in Java, Scala and Python Includes an advanced DAG execu@on engine that supports in- memory compu@ng Includes high level tools like SparkSQL, Mllib, GraphX, and Spark Streaming Can run in a cluster, standalone, or local Latest version is Spark.apache.org LOTS of momentum 5

6 Cloudera One PlaCorm Cloudera is doubling down on Spark Outlining a vision for the future Kudu, Record Service, Auto- tuning, Security, Kaia integra@on Challenging other vendors to par@cipate in Spark Development 6

7 Cloudera s Engineering Commitment to Spark Spark CommiPers by Hadoop Distribu/on* Spark Patches by Hadoop Distribu/on Hortonworks 17% Intel 17% Cloudera 67% Cloudera, 370 Hortonworks, 4 IBM, 12 MapR, 1 Intel, 400 * IBM and MapR have 0 commipers 7

8 Spark will replace MapReduce To become the standard engine for Hadoop 8

9 The Future of Data Processing on Hadoop Spark complemented by specialized fit- for- purpose engines General Data Processing w/ Spark Fast Batch Processing, Machine Learning, and Stream Processing Full- Text Search w/solr Querying textual data Analy/c Database w/ Impala Low- Latency Massively Concurrent Queries On- Disk Processing w/mapreduce Jobs at extreme scale and extremely disk IO intensive Shared: Data Storage Metadata Resource Management Security Governance 9

10 Why is Cloudera leading this Cloudera was the first Hadoop vendor to ship and support Spark Spark is a fully integrated part of Cloudera s placorm Shared data, metadata, resource management, administra@on, security, and governance Cloudera is the first Hadoop vendor to offer Spark training Trained more customers than any other vendor Cloudera has more Spark customers in produc@on than all other companies combined 10

11 Spark Overview and Improvements 11

12 Apache Spark Flexible, in- memory data processing for Hadoop Easy Development Rich APIs for Scala, Java, and Python shell Flexible Extensible API APIs for different types of workloads: Batch Streaming Machine Learning Graph Fast Batch & Stream Processing In- Memory processing and caching 12

13 Easy Development High Language Support Python lines = sc.textfile(...) lines.filter(lambda s: ERROR in s).count() Scala val lines = sc.textfile(...) lines.filter(s => s.contains( ERROR )).count() Java JavaRDD<String> lines = sc.textfile(...); lines.filter(new Function<String, Boolean>() { Boolean call(string s) { return s.contains( error ); } }).count(); Na@ve support for mul@ple languages with iden@cal APIs Scala, Java, Python Use of closures, itera@ons, and other common language constructs to minimize code 2-5x less code 13

14 Python Or Scala? Use Python for prototyping Spark Python API is slower than Scala Use Scala for development Steep learning curve for programming 14

15 Easy Development Use percolateur:spark srowen$./bin/spark-shell --master local[*]... Welcome to / / / / \ \/ _ \/ _ `/ / '_/ / /. /\_,_/_/ /_/\_\ version SNAPSHOT /_/ Using Scala version (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_51) Type in expressions to have them evaluated. Type :help for more information.... scala> val words = sc.textfile("file:/usr/share/dict/words")... words: org.apache.spark.rdd.rdd[string] = MapPartitionsRDD[1] at textfile at <console>:21 Interac@ve explora@on of data for data scien@sts No need to develop applica@ons Developers can prototype applica@on on live system scala> words.count... res0: Long = scala> 15

16 Easy Development Expressive API map filter groupby sort union join leftouterjoin rightouterjoin reduce count fold reducebykey groupbykey cogroup cross zip sample take first partitionby mapwith pipe save 16

17 Memory Management for Greater Performance Memory can be enabler for high performance big data applica/ons GB RAM 50 GB per second Trends: ½ price every 18 months 2x bandwidth every 3 years 16 cores 17

18 Spark Concepts RDD Resilient Distributed Dataset Caching DataFrames Spark Streaming SparkSQL Pluggable Spark 18

19 Resilient Distributed Dataset (RDD) Read- only of records Created through: of data in storage of RDDs Contains lineage to compute from storage Lazy Users control persistence and 19

20 RDD Transforma/ons create new RDD from an one Ac/ons run on RDD and return a value Transforma@ons are lazy Ac@ons materialize RDDs by compu@ng transforma@ons RDDs can be cached to avoid re- compu@ng 20

21 Example Transforma/ons Map Filter Sample Join Ac/ons Reduce Count First, Take SaveAs 21

22 Fault- Tolerance RDDs contain lineage Lineage: Source and list of Lost can be re- computed from source data msgs = textfile.filter(lambda s: s.startswith( ERROR )).map(lambda s: s.split( \t )[2]) HDFS File Filtered RDD Mapped RDD filter (func = startswith( )) map (func = split(...)) 22

23 Caching Storage Levels Different provide tradeoffs between memory usage and CPU efficiency. Cache when using algorithms. MEMORY_ONLY most CPU efficient, data has to fit in memory MEMORY_ONLY_SER More space efficient but reasonably fast MEMORY_AND_DISK MEMORY_AND_DISK_SER DISK_ONLY MEMORY_ONLY_2, MEMORY_AND_DISK_2 23

24 Data Frames Distributed of rows organized into named columns Spark SQL s Data Source API can read and write Data Frames using a variety of formats Hive, JSON, Parquet, HDFS Calling the DataFrame API can let you Select the columns you want Join data sources Aggregate and Filter Spark 1.5 lets you access the Hive Metastore to read/write schemas directly. 24

25 Spark Streaming What is it? Run con,nuous processing of data using Spark s core API Extends Spark concepts to fault- tolerant, transformable streams Adds rolling window opera@ons Example: Compute rolling averages or counts for data over last five minutes Benefits: Same programming paradigm for streaming and batch Excellent throughput Scale easily to support large volumes of data ingest Common Use Cases: On- the- fly ETL as data is ingested into Hadoop/HDFS Detect anomalous behavior and trigger alerts Con@nuous repor@ng of summary metrics for incoming data 25

26 Spark Streaming Architectures Spark Stream Processing Data Sources Integra/on Layer Data Prep / Scoring Ingest Flume Kaia Transformed Results HBase Real- Time Result Serving HDFS Spark Long- Term Analy/cs/ Model Building 26

27 SparkSQL Machine Learning Goal: Spark/Java Developers and Data can inline SQL into Spark apps Designed for: Ease of development for Spark developers Handful of concurrent Spark jobs Strengths: Ease of embedding SQL into Java or Scala SQL for common in developer flow (eg. filters, samples) 27

28 Impala Remains Tool of Choice for SQL Time (in seconds) Single User vs 10 User Response Time/Impala Times Faster (Lower bars = be`er) Single User, 5 10 Users, 11 Single User, x 10 Users, x Impala Spark SQL Presto Hive- on- Tez Single User, x 10 Users, x Single User, x 10 Users, x 28

29 Pluggable Spark replace MapReduce Cloudera is leading community development to port components to Spark: Stage 1 Stage 2 Stage 3 Crunch on Spark Search on Spark Hive on Spark (beta) Spark on HBase (beta) Pig on Spark (alpha) Sqoop on Spark Spark on Kudu 29

30 Spark Customer Use Cases Core Spark Financial Services Health PorColio Risk Analysis ETL Pipeline Speed- Up 20+ years of stock data disease- causing genes in the full human genome Calculate Jaccard scores on health care data sets Spark Streaming Financial Services Health Online Fraud Incident for Sepsis ERP Character and Bill Retail Online Systems Real- Time Inventory Management 1010 Data Services Trend analysis Document (LDA) Fraud Ad Tech Real- Time Ad Performance Analysis 30

31 Doing the Math Executors and Cores Determine the resource for the Spark job 4 Core 4 Core C 1 Core for Applica@on Master Cores for Executors 1 Executor 4 Cores x 15 Cores 3 Executors with 4 Cores Each (Leaves 3 Cores un- u@lized) Other Ra@os may lead to be`er resource u@liza@on 4 Core 1 Executor 2 Cores x 15 Cores 7 Executors with 2 Cores Each (Leaves 1 Core un- u@lized) 4 Core Don t exceed= 5 Cores per Executor h`p://blog.cloudera.com/blog/2015/03/how- to- tune- your- apache- spark- jobs- part- 2/ 16 Total Cores in Cluster Core Alloca@on Allocate Executors 31

32 Kudu 32

33 Kudu Design Goals High throughput for big scans (columnar storage and Goal: Within 2x of Parquet Low- latency for short accesses (primary key indexes and quorum design) Goal: 1ms read/write on SSD Database- like single- row ACID) Rela/onal data model SQL query NoSQL style scan/insert/update (Java client) 33

34 Kudu Storage for Fast on Fast Data DATA ENGINEERING DATA DISCOVERY & ANALYTICS DATA APPS BATCH STREAM SQL SEARCH MODEL ONLINE SPARK, HIVE, PIG SPARK IMPALA SOLR SPARK HBASE UNIFIED DATA SERVICES RESOURCE MANAGEMENT YARN SECURITY SENTRY New column store for Hadoop Simplifies the architecture for building on changing data Designed for fast performance integrated with Hadoop FILESYSTEM HDFS DATA INTEGRATION & STORAGE RELATIONAL KUDU NoSQL HBASE Apache- licensed open source (intent to donate to ASF) INGEST SQOOP, FLUME, KAFKA Beta now available 34

35 Kudu Trade- Offs Random updates will be slower HBase model allows random updates without incurring a disk seek Kudu requires a key lookup before update, Bloom lookup before insert Single- row reads may be slower Columnar design is op@mized for scans Future: may introduce column groups for applica@ons where single- row access is more important 35

36 Resources Join the community h`p://getkudu.io Download the Beta cloudera.com/downloads Read the Whitepaper getkudu.io/kudu.pdf 36

37 RecordService 37

38 Hadoop started out with zero security Didn t need it for the Silicon Valley applica@ons Does need it for Corporate applica@ons Cloudera is working on providing full featured Spark Security 38

39 Comprehensive, Compliance- Ready Security Audit, and Compliance Perimeter Access Visibility Data Guarding access to the cluster itself Defining what users and can do with data on where data came from and how it s being used Protec@ng data in the cluster from unauthorized visibility Technical Concepts: Authen@ca@on Network isola@on Technical Concepts: Permissions Authoriza@on Technical Concepts: Audi@ng Lineage Technical Concepts: Encryp@on, Tokeniza@on, Data masking Cloudera Manager Apache Sentry & RecordService Cloudera Navigator Navigator Encrypt & Key Trustee Partners 39

40 Directory and Kerberos Directory Manages Users, Groups, and Services Provides username / password authen@ca@on Group membership determines Service access Kerberos Trusted and standard third- party Authen@cated users receive Tickets Tickets gain access to Services User [ssmith] Password[***** ] User authen@cates to AD Authen@cated user gets Kerberos Ticket Ticket grants access to Services e.g. Impala 40

41 Fine- Grained Access Control in HDFS Across All Hadoop Paths Columns: column visibility varies by role (Ex. credit card numbers) Managers: Call Center: XXXX XXXX XXXX 5678 Analysts: XXXX XXXX XXXX XXXX Others: No access to credit card column Rows: Different user groups need access to different records European privacy laws Government security clearance Financial 41

42 RecordService Unified Access Control Enforcement DATA ENGINEERING DATA DISCOVERY & ANALYTICS DATA APPS BATCH STREAM SQL SEARCH MODEL ONLINE SPARK, HIVE, PIG SPARK IMPALA SOLR SPARK HBASE UNIFIED DATA SERVICES RESOURCE MANAGEMENT YARN SECURITY SENTRY, RECORDSERVICE New high performance security layer that centrally enforces access control policies across Hadoop Complements Apache Sentry s unified policy defini@on Row- and column- based security Dynamic data masking DATA INTEGRATION & STORAGE FILESYSTEM HDFS INGEST SQOOP, FLUME, KAFKA NoSQL HBASE Apache- licensed open source Beta now available 42

43 Fine- Grained HDFS Access without RecordService Split the original file Use HDFS permissions to limit access Date//me Accnt # SSN Asset Trade Country 09:33: :33: :12: :22: :55: :22: :45: :03: :55: AAPL Sell US TBT Buy EU IBM Sell UK INTC Buy US F Buy US UA Buy UK AMZN Sell EU TMV Buy US MA Buy UK Date//me Accnt # SSN Asset Trade Country 09:33: :22: :55: :03: Date//me Accnt # SSN Asset Trade Country 11:33: :45: Date//me Accnt # SSN Asset Trade Country 14:12: :22: :55: AAPL Sell US INTC Buy US F Buy US TMV Buy US TBT Buy EU AMZN Sell EU IBM Sell UK UA Buy UK MA Buy UK 43

44 Fine- Grained HDFS Access Control with RecordService Apply controls to the master data file Row, column, and sub- column (masking) controls Enforce these across all access paths Column- Level Controls What U.S. Brokers See Column- Level Controls Date//me Accnt # SSN Asset Trade Country Date//me Accnt # SSN Asset Trade Country 09:33: AAPL Sell US 09:33: XXX- XX AAPL Sell US 11:33: :12: :22: :55: TBT Buy EU IBM Sell EU INTC Buy US F Buy US Row- Level Controls 11:33: :12: :22: :55: TBT Buy group IBM Sell group XXX- XX INTC Buy US XXX- XX F Buy US Row- Level Controls 10:22: UA Buy EU 10:22: UA Buy group3 13:45: AMZN Sell EU 13:45: AMZN Sell group2 44

45 Spark Resources Learn Spark Spark Cookbook by Rishi Yadav O Reilly Advanced Analy@cs with Spark ebook (wri`en by Clouderans) Cloudera Developer Blog cloudera.com/spark Get Trained Cloudera Spark Training 45

Enabling Secure Hadoop Environments

Enabling Secure Hadoop Environments Enabling Secure Hadoop Environments Fred Koopmans Sr. Director of Product Management 1 The future of government is data management What s your strategy? 2 Cloudera s Enterprise Data Hub makes it possible

More information

Big Data Infrastructures & Technologies

Big Data Infrastructures & Technologies Big Data Infrastructures & Technologies Spark and MLLIB OVERVIEW OF SPARK What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: In-memory

More information

Introduction to Apache Spark. Patrick Wendell - Databricks

Introduction to Apache Spark. Patrick Wendell - Databricks Introduction to Apache Spark Patrick Wendell - Databricks What is Spark? Fast and Expressive Cluster Computing Engine Compatible with Apache Hadoop Efficient General execution graphs In-memory storage

More information

Spark Overview. Professor Sasu Tarkoma.

Spark Overview. Professor Sasu Tarkoma. Spark Overview 2015 Professor Sasu Tarkoma www.cs.helsinki.fi Apache Spark Spark is a general-purpose computing framework for iterative tasks API is provided for Java, Scala and Python The model is based

More information

Introducing Apache Kudu and RecordService (incubating)

Introducing Apache Kudu and RecordService (incubating) Introducing Apache Kudu and RecordService (incubating) Guido Oswald Sales Engineer, Switzerland April 2016, Swiss Big Data User Group Meetup 18 @GuidoOswald 1 Current storage landscape in Hadoop HDFS excels

More information

An Introduction to Apache Spark

An Introduction to Apache Spark An Introduction to Apache Spark Anastasios Skarlatidis @anskarl Software Engineer/Researcher IIT, NCSR "Demokritos" Outline Part I: Getting to know Spark Part II: Basic programming Part III: Spark under

More information

Lambda Architecture with Apache Spark

Lambda Architecture with Apache Spark Lambda Architecture with Apache Spark Michael Hausenblas, Chief Data Engineer MapR First Galway Data Meetup, 2015-02-03 2015 MapR Technologies 2015 MapR Technologies 1 Polyglot Processing 2015 2014 MapR

More information

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop

More information

Submitted to: Dr. Sunnie Chung. Presented by: Sonal Deshmukh Jay Upadhyay

Submitted to: Dr. Sunnie Chung. Presented by: Sonal Deshmukh Jay Upadhyay Submitted to: Dr. Sunnie Chung Presented by: Sonal Deshmukh Jay Upadhyay Submitted to: Dr. Sunny Chung Presented by: Sonal Deshmukh Jay Upadhyay What is Apache Survey shows huge popularity spike for Apache

More information

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development:: Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized

More information

Big Data security, tools and tips to protect information assets

Big Data security, tools and tips to protect information assets Big Data security, tools and tips to protect information assets Eddie Garcia Chief Security Architect 1 A Big Data Revolution is Happening as We Speak Industrial Revolution Data Revolution 2 The Benefits

More information

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANT SCALE www.elephantscale.com sujee@elephantscale.com HI, I M SUJEE MANIYAM Founder / Principal @ ElephantScale

More information

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance

More information

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals

More information

Data processing in Apache Spark

Data processing in Apache Spark Data processing in Apache Spark Pelle Jakovits 5 October, 2015, Tartu Outline Introduction to Spark Resilient Distributed Datasets (RDD) Data operations RDD transformations Examples Fault tolerance Frameworks

More information

Data processing in Apache Spark

Data processing in Apache Spark Data processing in Apache Spark Pelle Jakovits 8 October, 2014, Tartu Outline Introduction to Spark Resilient Distributed Data (RDD) Available data operations Examples Advantages and Disadvantages Frameworks

More information

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta iorhian@gmail.com Radu Chilom radu.chilom@gmail.com Big data analytics / machine learning 6+ years

More information

New Developments in Spark

New Developments in Spark New Developments in Spark And Rethinking APIs for Big Data Matei Zaharia and many others What is Spark? Unified computing engine for big data apps > Batch, streaming and interactive Collection of high-level

More information

Stay Informed During and AEer OpenWorld

Stay Informed During and AEer OpenWorld Stay Informed During and AEer OpenWorld TwiIer: @OracleBigData, @OracleExadata, @Infrastructure Follow #CloudReady LinkedIn: Oracle IT Infrastructure Oracle Showcase Page Oracle Big Data Oracle Showcase

More information

Spark and distributed data processing

Spark and distributed data processing Stanford CS347 Guest Lecture Spark and distributed data processing Reynold Xin @rxin 2016-05-23 Who am I? Reynold Xin PMC member, Apache Spark Cofounder & Chief Architect, Databricks PhD on leave (ABD),

More information

CSE 444: Database Internals. Lecture 23 Spark

CSE 444: Database Internals. Lecture 23 Spark CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei

More information

Data processing in Apache Spark

Data processing in Apache Spark Data processing in Apache Spark Pelle Jakovits 21 October, 2015, Tartu Outline Introduction to Spark Resilient Distributed Datasets (RDD) Data operations RDD transformations Examples Fault tolerance Streaming

More information

MapR Enterprise Hadoop

MapR Enterprise Hadoop 2014 MapR Technologies 2014 MapR Technologies 1 MapR Enterprise Hadoop Top Ranked Cloud Leaders 500+ Customers 2014 MapR Technologies 2 Key MapR Advantage Partners Business Services APPLICATIONS & OS ANALYTICS

More information

Apache Spark. CS240A T Yang. Some of them are based on P. Wendell s Spark slides

Apache Spark. CS240A T Yang. Some of them are based on P. Wendell s Spark slides Apache Spark CS240A T Yang Some of them are based on P. Wendell s Spark slides Parallel Processing using Spark+Hadoop Hadoop: Distributed file system that connects machines. Mapreduce: parallel programming

More information

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case

More information

Big Data Architect.

Big Data Architect. Big Data Architect www.austech.edu.au WHAT IS BIG DATA ARCHITECT? A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional

More information

An Overview of Apache Spark

An Overview of Apache Spark An Overview of Apache Spark CIS 612 Sunnie Chung 2014 MapR Technologies 1 MapReduce Processing Model MapReduce, the parallel data processing paradigm, greatly simplified the analysis of big data using

More information

Apache Spark. Easy and Fast Big Data Analytics Pat McDonough

Apache Spark. Easy and Fast Big Data Analytics Pat McDonough Apache Spark Easy and Fast Big Data Analytics Pat McDonough Founded by the creators of Apache Spark out of UC Berkeley s AMPLab Fully committed to 100% open source Apache Spark Support and Grow the Spark

More information

A Tutorial on Apache Spark

A Tutorial on Apache Spark A Tutorial on Apache Spark A Practical Perspective By Harold Mitchell The Goal Learning Outcomes The Goal Learning Outcomes NOTE: The setup, installation, and examples assume Windows user Learn the following:

More information

DATA SCIENCE USING SPARK: AN INTRODUCTION

DATA SCIENCE USING SPARK: AN INTRODUCTION DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2 DATA SCIENCE PROCESS Exploratory Data

More information

Processing of big data with Apache Spark

Processing of big data with Apache Spark Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT

More information

OLTP on Hadoop: Reviewing the first Hadoop- based TPC- C benchmarks

OLTP on Hadoop: Reviewing the first Hadoop- based TPC- C benchmarks OLTP on Hadoop: Reviewing the first Hadoop- based TPC- C benchmarks Monte Zweben Co- Founder and Chief Execu6ve Officer John Leach Co- Founder and Chief Technology Officer September 30, 2015 The Tradi6onal

More information

MapReduce review. Spark and distributed data processing. Who am I? Today s Talk. Reynold Xin

MapReduce review. Spark and distributed data processing. Who am I? Today s Talk. Reynold Xin Who am I? Reynold Xin Stanford CS347 Guest Lecture Spark and distributed data processing PMC member, Apache Spark Cofounder & Chief Architect, Databricks PhD on leave (ABD), UC Berkeley AMPLab Reynold

More information

Big Data Hadoop Stack

Big Data Hadoop Stack Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware

More information

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Hadoop 1.0 Architecture Introduction to Hadoop & Big Data Hadoop Evolution Hadoop Architecture Networking Concepts Use cases

More information

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture Big Data Syllabus Hadoop YARN Setup Programming in YARN framework j Understanding big data and Hadoop Big Data Limitations and Solutions of existing Data Analytics Architecture Hadoop Features Hadoop Ecosystem

More information

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI) CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI) The Certificate in Software Development Life Cycle in BIGDATA, Business Intelligence and Tableau program

More information

Distributed Computing with Spark

Distributed Computing with Spark Distributed Computing with Spark Reza Zadeh Thanks to Matei Zaharia Outline Data flow vs. traditional network programming Limitations of MapReduce Spark computing engine Numerical computing on Spark Ongoing

More information

Apache Spark 2.0. Matei

Apache Spark 2.0. Matei Apache Spark 2.0 Matei Zaharia @matei_zaharia What is Apache Spark? Open source data processing engine for clusters Generalizes MapReduce model Rich set of APIs and libraries In Scala, Java, Python and

More information

Backtesting with Spark

Backtesting with Spark Backtesting with Spark Patrick Angeles, Cloudera Sandy Ryza, Cloudera Rick Carlin, Intel Sheetal Parade, Intel 1 Traditional Grid Shared storage Storage and compute scale independently Bottleneck on I/O

More information

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou The Hadoop Ecosystem EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca A lot of tools designed to work with Hadoop 2 HDFS, MapReduce Hadoop Distributed File System Core Hadoop component

More information

An Introduction to Apache Spark

An Introduction to Apache Spark An Introduction to Apache Spark 1 History Developed in 2009 at UC Berkeley AMPLab. Open sourced in 2010. Spark becomes one of the largest big-data projects with more 400 contributors in 50+ organizations

More information

Big Data Analytics using Apache Hadoop and Spark with Scala

Big Data Analytics using Apache Hadoop and Spark with Scala Big Data Analytics using Apache Hadoop and Spark with Scala Training Highlights : 80% of the training is with Practical Demo (On Custom Cloudera and Ubuntu Machines) 20% Theory Portion will be important

More information

microsoft

microsoft 70-775.microsoft Number: 70-775 Passing Score: 800 Time Limit: 120 min Exam A QUESTION 1 Note: This question is part of a series of questions that present the same scenario. Each question in the series

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION

More information

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours) Bigdata Fundamentals Day1: (2hours) 1. Understanding BigData. a. What is Big Data? b. Big-Data characteristics. c. Challenges with the traditional Data Base Systems and Distributed Systems. 2. Distributions:

More information

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics IBM Data Science Experience White paper R Transforming R into a tool for big data analytics 2 R Executive summary This white paper introduces R, a package for the R statistical programming language that

More information

Jialin Liu, Evan Racah, Quincey Koziol, Richard Shane Canon, Alex Gittens, Lisa Gerhardt, Suren Byna, Mike F. Ringenburg, Prabhat

Jialin Liu, Evan Racah, Quincey Koziol, Richard Shane Canon, Alex Gittens, Lisa Gerhardt, Suren Byna, Mike F. Ringenburg, Prabhat H5Spark H5Spark: Bridging the I/O Gap between Spark and Scien9fic Data Formats on HPC Systems Jialin Liu, Evan Racah, Quincey Koziol, Richard Shane Canon, Alex Gittens, Lisa Gerhardt, Suren Byna, Mike

More information

Innovatus Technologies

Innovatus Technologies HADOOP 2.X BIGDATA ANALYTICS 1. Java Overview of Java Classes and Objects Garbage Collection and Modifiers Inheritance, Aggregation, Polymorphism Command line argument Abstract class and Interfaces String

More information

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark Agenda Spark Platform Spark Core Spark Extensions Using Apache Spark About me Vitalii Bondarenko Data Platform Competency Manager Eleks www.eleks.com 20 years in software development 9+ years of developing

More information

Dell In-Memory Appliance for Cloudera Enterprise

Dell In-Memory Appliance for Cloudera Enterprise Dell In-Memory Appliance for Cloudera Enterprise Spark Technology Overview and Streaming Workload Use Cases Author: Armando Acosta Hadoop Product Manager/Subject Matter Expert Armando_Acosta@Dell.com/

More information

Unifying Big Data Workloads in Apache Spark

Unifying Big Data Workloads in Apache Spark Unifying Big Data Workloads in Apache Spark Hossein Falaki @mhfalaki Outline What s Apache Spark Why Unification Evolution of Unification Apache Spark + Databricks Q & A What s Apache Spark What is Apache

More information

MapReduce, Hadoop and Spark. Bompotas Agorakis

MapReduce, Hadoop and Spark. Bompotas Agorakis MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)

More information

Big Data Analytics. Map-Reduce and Spark. Slides by: Joseph E. Gonzalez Guest Lecturer: Vikram Sreekanti

Big Data Analytics. Map-Reduce and Spark. Slides by: Joseph E. Gonzalez Guest Lecturer: Vikram Sreekanti Big Data Analytics Map-Reduce and Spark Slides by: Joseph E. Gonzalez jegonzal@cs.berkeley.edu? Guest Lecturer: Vikram Sreekanti From SQL to Big Data (with SQL) Ø A few weeks ago Ø Databases Ø (Relational)

More information

Big Data Infrastructures & Technologies Hadoop Streaming Revisit.

Big Data Infrastructures & Technologies Hadoop Streaming Revisit. Big Data Infrastructures & Technologies Hadoop Streaming Revisit ENRON Mapper ENRON Mapper Output (Excerpt) acomnes@enron.com blake.walker@enron.com edward.snowden@cia.gov alex.berenson@nyt.com ENRON Reducer

More information

Hadoop Development Introduction

Hadoop Development Introduction Hadoop Development Introduction What is Bigdata? Evolution of Bigdata Types of Data and their Significance Need for Bigdata Analytics Why Bigdata with Hadoop? History of Hadoop Why Hadoop is in demand

More information

Introduc)on to Apache Ka1a. Jun Rao Co- founder of Confluent

Introduc)on to Apache Ka1a. Jun Rao Co- founder of Confluent Introduc)on to Apache Ka1a Jun Rao Co- founder of Confluent Agenda Why people use Ka1a Technical overview of Ka1a What s coming What s Apache Ka1a Distributed, high throughput pub/sub system Ka1a Usage

More information

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F.

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F. Applied Spark From Concepts to Bitcoin Analytics Andrew F. Hart ahart@apache.org @andrewfhart My Day Job CTO, Pogoseat Upgrade technology for live events 3/28/16 QCON-SP Andrew Hart 2 Additionally Member,

More information

Hadoop. Introduction / Overview

Hadoop. Introduction / Overview Hadoop Introduction / Overview Preface We will use these PowerPoint slides to guide us through our topic. Expect 15 minute segments of lecture Expect 1-4 hour lab segments Expect minimal pretty pictures

More information

Certified Big Data Hadoop and Spark Scala Course Curriculum

Certified Big Data Hadoop and Spark Scala Course Curriculum Certified Big Data Hadoop and Spark Scala Course Curriculum The Certified Big Data Hadoop and Spark Scala course by DataFlair is a perfect blend of indepth theoretical knowledge and strong practical skills

More information

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423

More information

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM Spark 2 Alexey Zinovyev, Java/BigData Trainer in EPAM With IT since 2007 With Java since 2009 With Hadoop since 2012 With EPAM since 2015 About Secret Word from EPAM itsubbotnik Big Data Training 3 Contacts

More information

April Copyright 2013 Cloudera Inc. All rights reserved.

April Copyright 2013 Cloudera Inc. All rights reserved. Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and the Virtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here April 2014 Analytic Workloads on

More information

Oracle Big Data Fundamentals Ed 1

Oracle Big Data Fundamentals Ed 1 Oracle University Contact Us: +0097143909050 Oracle Big Data Fundamentals Ed 1 Duration: 5 Days What you will learn In the Oracle Big Data Fundamentals course, learn to use Oracle's Integrated Big Data

More information

Exam Questions

Exam Questions Exam Questions 70-775 Perform Data Engineering on Microsoft Azure HDInsight (beta) https://www.2passeasy.com/dumps/70-775/ NEW QUESTION 1 You are implementing a batch processing solution by using Azure

More information

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second

More information

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig Analytics in Spark Yanlei Diao Tim Hunter Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig Outline 1. A brief history of Big Data and Spark 2. Technical summary of Spark 3. Unified analytics

More information

Cloud Computing & Visualization

Cloud Computing & Visualization Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International

More information

Cloudera Kudu Introduction

Cloudera Kudu Introduction Cloudera Kudu Introduction Zbigniew Baranowski Based on: http://slideshare.net/cloudera/kudu-new-hadoop-storage-for-fast-analytics-onfast-data What is KUDU? New storage engine for structured data (tables)

More information

Evolution of the Logging Service Hands-on Hadoop Proof of Concept for CALS-2.0

Evolution of the Logging Service Hands-on Hadoop Proof of Concept for CALS-2.0 Evolution of the Logging Service Hands-on Hadoop Proof of Concept for CALS-2.0 Chris Roderick Marcin Sobieszek Piotr Sowinski Nikolay Tsvetkov Jakub Wozniak Courtesy IT-DB Agenda Intro to CALS System Hadoop

More information

Distributed Computing with Spark and MapReduce

Distributed Computing with Spark and MapReduce Distributed Computing with Spark and MapReduce Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Traditional Network Programming Message-passing between nodes (e.g. MPI) Very difficult to do at scale:» How

More information

HBase... And Lewis Carroll! Twi:er,

HBase... And Lewis Carroll! Twi:er, HBase... And Lewis Carroll! jw4ean@cloudera.com Twi:er, LinkedIn: @jw4ean 1 Introduc@on 2010: Cloudera Solu@ons Architect 2011: Cloudera TAM/DSE 2012-2013: Cloudera Training focusing on Partners and Newbies

More information

Khadija Souissi. Auf z Systems November IBM z Systems Mainframe Event 2016

Khadija Souissi. Auf z Systems November IBM z Systems Mainframe Event 2016 Khadija Souissi Auf z Systems 07. 08. November 2016 @ IBM z Systems Mainframe Event 2016 Acknowledgements Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

More information

Time Series Storage with Apache Kudu (incubating)

Time Series Storage with Apache Kudu (incubating) Time Series Storage with Apache Kudu (incubating) Dan Burkert (Committer) dan@cloudera.com @danburkert Tweet about this talk: @getkudu or #kudu 1 Time Series machine metrics event logs sensor telemetry

More information

Big Data Hadoop Course Content

Big Data Hadoop Course Content Big Data Hadoop Course Content Topics covered in the training Introduction to Linux and Big Data Virtual Machine ( VM) Introduction/ Installation of VirtualBox and the Big Data VM Introduction to Linux

More information

Configuring and Deploying Hadoop Cluster Deployment Templates

Configuring and Deploying Hadoop Cluster Deployment Templates Configuring and Deploying Hadoop Cluster Deployment Templates This chapter contains the following sections: Hadoop Cluster Profile Templates, on page 1 Creating a Hadoop Cluster Profile Template, on page

More information

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem Zohar Elkayam www.realdbamagic.com Twitter: @realmgic Who am I? Zohar Elkayam, CTO at Brillix Programmer, DBA, team leader, database trainer,

More information

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018 Big Data com Hadoop Impala, Hive e Spark VIII Sessão - SQL Bahia 03/03/2018 Diógenes Pires Connect with PASS Sign up for a free membership today at: pass.org #sqlpass Internet Live http://www.internetlivestats.com/

More information

Turning Relational Database Tables into Spark Data Sources

Turning Relational Database Tables into Spark Data Sources Turning Relational Database Tables into Spark Data Sources Kuassi Mensah Jean de Lavarene Director Product Mgmt Director Development Server Technologies October 04, 2017 3 Safe Harbor Statement The following

More information

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk Raanan Dagan and Rohit Pujari September 25, 2017 Washington, DC Forward-Looking Statements During the course of this presentation, we may

More information

Hadoop course content

Hadoop course content course content COURSE DETAILS 1. In-detail explanation on the concepts of HDFS & MapReduce frameworks 2. What is 2.X Architecture & How to set up Cluster 3. How to write complex MapReduce Programs 4. In-detail

More information

Big data systems 12/8/17

Big data systems 12/8/17 Big data systems 12/8/17 Today Basic architecture Two levels of scheduling Spark overview Basic architecture Cluster Manager Cluster Cluster Manager 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores

More information

Security and Performance advances with Oracle Big Data SQL

Security and Performance advances with Oracle Big Data SQL Security and Performance advances with Oracle Big Data SQL Jean-Pierre Dijcks Oracle Redwood Shores, CA, USA Key Words SQL, Oracle, Database, Analytics, Object Store, Files, Big Data, Big Data SQL, Hadoop,

More information

Oracle Big Data Fundamentals Ed 2

Oracle Big Data Fundamentals Ed 2 Oracle University Contact Us: 1.800.529.0165 Oracle Big Data Fundamentals Ed 2 Duration: 5 Days What you will learn In the Oracle Big Data Fundamentals course, you learn about big data, the technologies

More information

Stages of Data Processing

Stages of Data Processing Data processing can be understood as the conversion of raw data into a meaningful and desired form. Basically, producing information that can be understood by the end user. So then, the question arises,

More information

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018 Cloud Computing 3 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning

More information

Hortonworks and The Internet of Things

Hortonworks and The Internet of Things Hortonworks and The Internet of Things Dr. Bernhard Walter Solutions Engineer About Hortonworks Customer Momentum ~700 customers (as of November 4, 2015) 152 customers added in Q3 2015 Publicly traded

More information

MapReduce, Apache Hadoop

MapReduce, Apache Hadoop NDBI040: Big Data Management and NoSQL Databases hp://www.ksi.mff.cuni.cz/ svoboda/courses/2016-1-ndbi040/ Lecture 2 MapReduce, Apache Hadoop Marn Svoboda svoboda@ksi.mff.cuni.cz 11. 10. 2016 Charles University

More information

Introduction to Big-Data

Introduction to Big-Data Introduction to Big-Data Ms.N.D.Sonwane 1, Mr.S.P.Taley 2 1 Assistant Professor, Computer Science & Engineering, DBACER, Maharashtra, India 2 Assistant Professor, Information Technology, DBACER, Maharashtra,

More information

Resilient Distributed Datasets

Resilient Distributed Datasets Resilient Distributed Datasets A Fault- Tolerant Abstraction for In- Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin,

More information

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Blended Learning Outline: Cloudera Data Analyst Training (171219a) Blended Learning Outline: Cloudera Data Analyst Training (171219a) Cloudera Univeristy s data analyst training course will teach you to apply traditional data analytics and business intelligence skills

More information

Spark, Shark and Spark Streaming Introduction

Spark, Shark and Spark Streaming Introduction Spark, Shark and Spark Streaming Introduction Tushar Kale tusharkale@in.ibm.com June 2015 This Talk Introduction to Shark, Spark and Spark Streaming Architecture Deployment Methodology Performance References

More information

Analyzing Flight Data

Analyzing Flight Data IBM Analytics Analyzing Flight Data Jeff Carlson Rich Tarro July 21, 2016 2016 IBM Corporation Agenda Spark Overview a quick review Introduction to Graph Processing and Spark GraphX GraphX Overview Demo

More information

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training:: Module Title Duration : Cloudera Data Analyst Training : 4 days Overview Take your knowledge to the next level Cloudera University s four-day data analyst training course will teach you to apply traditional

More information

Bringing Data to Life

Bringing Data to Life Bringing Data to Life Data management and Visualization Techniques Benika Hall Rob Harrison Corporate Model Risk March 16, 2018 Introduction Benika Hall Analytic Consultant Wells Fargo - Corporate Model

More information

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam Impala A Modern, Open Source SQL Engine for Hadoop Yogesh Chockalingam Agenda Introduction Architecture Front End Back End Evaluation Comparison with Spark SQL Introduction Why not use Hive or HBase?

More information

BigDataBench- S: An Open- source Scien6fic Big Data Benchmark Suite

BigDataBench- S: An Open- source Scien6fic Big Data Benchmark Suite BigDataBench- S: An Open- source Scien6fic Big Data Benchmark Suite Xinhui Tian, Shaopeng Dai, Zhihui Du, Wanling Gao, Rui Ren, Yaodong Cheng, Zhifei Zhang, Zhen Jia, Peijian Wang and Jianfeng Zhan INSTITUTE

More information

Specialist ICT Learning

Specialist ICT Learning Specialist ICT Learning APPLIED DATA SCIENCE AND BIG DATA ANALYTICS GTBD7 Course Description This intensive training course provides theoretical and technical aspects of Data Science and Business Analytics.

More information

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data Introduction to Hadoop High Availability Scaling Advantages and Challenges Introduction to Big Data What is Big data Big Data opportunities Big Data Challenges Characteristics of Big data Introduction

More information